Initial commit

2026-05-11 08:20:05 +00:00 · 2025-04-06 14:42:49 +02:00
commit 543c4b2771
186 changed files with 61017 additions and 0 deletions
--- a/MANIFEST.in
+++ b/MANIFEST.in
@@ -0,0 +1 @@
+include exllamav3/util/hadamard_data/*
--- a/README.md
+++ b/README.md
@@ -0,0 +1,112 @@
+
+# <img src="doc/cat.png" width="40"> ExlLlamaV3
+
+This is an **early preview release** of ExLlamaV3. Please note: ↙
+
+- The framework <u>is not yet fully optimized</u>. Performance is lacking, especially on Ampere, and there may be a significant CPU bottleneck on slower processors until the extension functions are fully built out.
+- AMD GPUs (ROCm) are not yet supported.
+- [FlashAttention-2](https://github.com/Dao-AILab/flash-attention) is currently required. I hope to switch over to [FlashInfer](https://github.com/flashinfer-ai/flashinfer/tree/main) in time, but there are some obstacles to overcome first. 
+- A number of important features are yet to be added, such as cache quantization, tensor parallelism and multimodal support.
+- There are no release builds yet.
+- Integration into [TabbyAPI](https://github.com/theroyallab/tabbyAPI/) is planned when all the core functionality is in place.
+
+### Why?
+
+As the name implies, the original intention for ExLlama was to run inference on quantized Llama models. ExLlamaV2 was able to support a number of other architectures by treating every new model as (more or less) a Llama variant with optional features. However, as new models are increasingly moving away from the basic transformer template, this approach is no longer sustainable.  
+
+Additionally, ExLlamaV2 is largely designed to run in a single process and CUDA doesn't like this very much when spreading a workload across multiple GPUs. It's a fundamental design feature in the CUDA runtime, and it has become a major obstacle to tensor-parallel inference, demand for which seems to keep increasing. This shortcoming is not easily addressed without a rewrite. Moreover, the **EXL2** format doesn't lend itself well to parallel inference in the first place due its input channel permutation.
+
+ Aside from lifting a few of the most successful features from V2 (such as the generator), ExLlamaV3 is largely rewritten from scratch to provide a cleaner, more modular framework for supporting newer architectures. It also introduces a new SOTA quantization format based on [**QTIP**](https://github.com/Cornell-RelaxML/qtip) (see below).
+
+### What's missing?
+
+There's much that still needs to be added and/or ported over from ExLlamaV2. I've decided to release ExLlamaV3 in its current state to invite testing, feedback and contributions, but please be aware that it's not yet a viable replacement for ExLlamaV2. Currently on the to-do list:
+
+- Support for more architectures (Mixtral, Cohere and Deepseek are in the works)
+- Samplers (most notably repetition penalties and min-P are missing)
+- Constrained sampling (JSON filters etc.)
+- Multimodal support
+- Cache quantization
+- LoRA support
+- ROCm support
+- Tensor-parallel inference
+- Lots of optimization
+
+As for what is implemented, expect that some things may be a little broken at first. Please be patient and/or contribute. 👉👈 
+
+### How to?
+
+#### Installation
+
+Detailed installation instructions are coming soon, along with prebuilt wheels. For the time being, you can install the library with:
+
+```sh
+# Full installation
+pip install -r requirements.txt
+pip install .
+
+# JIT mode
+EXLLAMA_NOCOMPILE=1 pip install . 
+```
+
+Note that the included scripts can run in JIT mode from the repo directory without installing the library.
+
+#### Conversion
+
+To convert a model to EXL3 format, use:
+
+```sh
+# Convert model
+python convert.py -i <input_dir> -o <output_dir> -w <working_dir> -b <bitrate>
+
+# Resume an interrupted quant job
+convert.py -w <working_dir> -r
+
+# More options
+convert.py -h
+```
+
+The working directory is temporary storage for state checkpoints and for storing quantized tensors until the converted model can be compiled. It should have enough free space to store an entire copy of the output model. Note that while EXL2 conversion by default resumes an interrupted job when pointed to an existing folder, EXL3 needs you to explicitly resume with the `-r`/`--resume` argument.    
+
+#### Examples
+
+A number of example scripts are provided to showcase the features of the backend and generator. Some of them have hardcoded model paths and should be edited before you run them, but there is a simple CLI chatbot that you can start with:
+
+```sh
+python examples/chat.py -m <input_dir> -mode <prompt_mode> 
+
+# E.g.:
+python examples/chat.py -m /mnt/models/llama3.1-8b-instruct-exl3 -mode llama3
+```
+
+### EXL3 quantization
+
+<figure class="image" align="center">
+    <a href="doc/exl3.md" target="_blank">
+        <img src="doc/llama31_8b_instruct_bpw.png" width="800">
+    </a>
+</figure>
+
+Despite their amazing achievements, most SOTA quantization techniques remain cumbersome or even prohibitively expensive to use. For instance, **AQLM** quantization of a 70B model takes around **720 GPU-hours** on an A100 server, costing $850 US at the time of writing. ExLlamaV3 aims to address this with **EXL3** format, which is a streamlined variant of [**QTIP**](https://github.com/Cornell-RelaxML/qtip) from Cornell RelaxML. The conversion process is designed to be simple and efficient and requires only an input model (in HF format) and a target bitrate. By computing Hessians on the fly and thanks to a fused Viterbi kernel, the quantizer can convert a model in a single step, taking a couple of minutes for a smaller models, up to a few hours for larger ones (70B+) (on a single RTX 4090 or equivalent GPU.)
+
+The [Marlin](https://github.com/IST-DASLab/marlin)-inspired GEMM kernel achieves roughly memory-bound latency under optimal conditions (4bpw, RTX 4090), though it still needs some work to achieve the same efficiency on Ampere GPUs and to remain memory-bound at lower bitrates.
+
+Since converted models largely retain the original file structure (unlike **EXL2** which renames some tensors in its quest to turn every model into a Llama variant), it will be possible to extend **EXL3** support to other frameworks like HF Transformers and vLLM.
+
+There are some benchmark results [here](doc/exl3.md), and a full writeup on the format is coming soon.
+
+Fun fact: Llama-3.1-70B-EXL3 is coherent at 1.6 bpw. With the output layer quantized to 3 bpw and a 4096-token cache, inference is possible in under 16 GB of VRAM. 
+
+A selection of EXL3-quantized models is available on [🤗 Hugging Face](https://huggingface.co/turboderp).
+
+
+### Acknowledgements
+
+This project owes its existence to a wonderful community of FOSS developers and some very generous supporters (🐈❤️!) The following projects in particular deserve a special mention:
+
+- [TabbyAPI](https://github.com/theroyallab/tabbyAPI/)
+- [PyTorch](https://github.com/pytorch/pytorch)
+- [FlashAttention](https://github.com/Dao-AILab/flash-attention)
+- [QTIP](https://github.com/Cornell-RelaxML/qtip)
+- [Transformers](https://github.com/huggingface/transformers)
+- [Marlin](https://github.com/IST-DASLab/marlin)
--- a/convert.py
+++ b/convert.py
@@ -0,0 +1,11 @@
+from exllamav3.conversion.convert_model import parser, main, prepare
+
+# Script included in package: ./exllamav3/conversion/convert_model.py
+
+if __name__ == "__main__":
+    _args = parser.parse_args()
+    _in_args, _job_state, _ok, _err = prepare(_args)
+    if not _ok:
+        print(f" !! Error: {_err}")
+    else:
+        main(_in_args, _job_state)
--- a/doc/cat.png
+++ b/doc/cat.png
--- a/doc/exl3.md
+++ b/doc/exl3.md
@@ -0,0 +1,77 @@
+# EXL3 quantization
+
+The new **EXL3** format is a variant of [**QTIP**](https://github.com/Cornell-RelaxML/qtip). Like **QTIP** it uses a procedural codebook and encodes high-dimensional vectors into optimal tail-biting trellis structures, but it deviates from **QTIP** in how tensors are regularized and packed. A full description of the format is coming, but until then I refer to the code for the [quantizer](../exllamav3/modules/quant/exl3_lib/quantize.py) and associated [kernels](../exllamav3/exllamav3_ext/quant), the [**QTIP**](https://arxiv.org/abs/2406.11235) and [**QuIP#**](https://arxiv.org/abs/2402.04396) papers, as well as this [excellent writeup](https://www.together.ai/blog/even-better-even-faster-quantized-llms-with-qtip) on **QTIP** from together.ai.
+
+It turns out to be difficult to collect enough examples of models converted with the various SOTA (or SOTA-adjacent) methods. I attribute the lack of options largely to how difficult it is to work with these formats in the first place, hence this project. Following are some benchmarks and comparisons to other formats I was able to find samples of. A couple of notes:
+
+- I have not yet been able to make regular **QTIP** inference work (go figure) but it's probably safe to assume it would match or outperform **EXL3** in accuracy, being largely the same method except with more options.
+- Accounting for quantization of the output layer can make a huge difference in practice, especially for smaller models. So I am including two versions of each perplexity graph, one with bitrate on the horizontal axis, and one that measures the entire VRAM footprint of the weights (not counting the embedding layer which for most inference tasks can be relegated to system RAM.)
+- **GGUF** i-quants are abundant, and it's worth noting that they hold up well in comparison to SOTA formats.
+
+### Perplexity tests
+
+The [eval/compare_q.py](../eval/compare_q.py) script makes an apples-to-apples comparison between formats, measuring perplexity on the wiki2 test set across available bitrates while ensuring that tokenization and scoring remains consistent throughout.
+
+<p align="center">
+<b>Llama 3.1 8B Instruct</b>
+</p>
+
+<figure class="image" align="center">
+  <a href="llama31_8b_instruct_bpw.png" target="_blank">
+    <img src="llama31_8b_instruct_bpw.png" width="400">
+  </a>
+  <a href="llama31_8b_instruct_vram.png" target="_blank">
+    <img src="llama31_8b_instruct_vram.png" width="400">
+  </a>
+</figure>
+
+<p align="center">
+<b>Llama 3.1 70B Instruct</b>
+</p>
+
+<figure class="image" align="center">
+  <a href="llama31_70b_instruct_bpw.png" target="_blank">
+    <img src="llama31_70b_instruct_bpw.png" width="400">
+  </a>
+  <a href="llama31_70b_instruct_vram.png" target="_blank">
+    <img src="llama31_70b_instruct_vram.png" width="400">
+  </a>
+</figure>
+
+<p align="center">
+<b>Llama 3.2 1B Instruct</b>
+</p>
+
+<figure class="image" align="center">
+  <a href="llama32_1b_instruct_bpw.png" target="_blank">
+    <img src="llama32_1b_instruct_bpw.png" width="400">
+  </a>
+  <a href="llama32_1b_instruct_vram.png" target="_blank">
+    <img src="llama32_1b_instruct_vram.png" width="400">
+  </a>
+</figure>
+
+<p align="center">
+<b>Mistral 7B Instruct v0.3</b>
+</p>
+
+<figure class="image" align="center">
+  <a href="mistral_7b_instruct_v0.3_bpw.png" target="_blank">
+    <img src="mistral_7b_instruct_v0.3_bpw.png" width="400">
+  </a>
+  <a href="mistral_7b_instruct_v0.3_vram.png" target="_blank">
+    <img src="mistral_7b_instruct_v0.3_vram.png" width="400">
+  </a>
+</figure>
+
+### HumanEval
+
+For the models tested here, HumanEval scores align closely with results advertised by the publishers or collected from other sources. Some deviation is to be expected due to differences in prompting and sampling, as well as random variation. See the [eval/humaneval.py](../eval/humaneval.py) script for specifics. The occasional bump around 3 bpw is repeatable and statistically significant, likely worth investigating. 
+
+<figure class="image" align="center">
+  <img src="humaneval.png" width="800">
+</figure>
+
+### Further work
+
+More evaluations are underway (MMLU, MMLU-Pro, etc.), and more models will be tested as architectures are added.
--- a/doc/gumbel_eval.png
+++ b/doc/gumbel_eval.png
--- a/doc/humaneval.png
+++ b/doc/humaneval.png
--- a/doc/llama31_70b_instruct_bpw.png
+++ b/doc/llama31_70b_instruct_bpw.png
--- a/doc/llama31_70b_instruct_vram.png
+++ b/doc/llama31_70b_instruct_vram.png
--- a/doc/llama31_8b_instruct_bpw.png
+++ b/doc/llama31_8b_instruct_bpw.png
--- a/doc/llama31_8b_instruct_vram.png
+++ b/doc/llama31_8b_instruct_vram.png
--- a/doc/llama32_1b_instruct_bpw.png
+++ b/doc/llama32_1b_instruct_bpw.png
--- a/doc/llama32_1b_instruct_vram.png
+++ b/doc/llama32_1b_instruct_vram.png
--- a/doc/mistral_7b_instruct_v0.3_bpw.png
+++ b/doc/mistral_7b_instruct_v0.3_bpw.png
--- a/doc/mistral_7b_instruct_v0.3_vram.png
+++ b/doc/mistral_7b_instruct_v0.3_vram.png
--- a/doc/procedural_codebook.png
+++ b/doc/procedural_codebook.png
--- a/eval/compare_q.py
+++ b/eval/compare_q.py
@@ -0,0 +1,264 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+import torch
+import torch.nn.functional as F
+from exllamav3.util.file import disk_lru_cache, disk_lru_cache_clear
+from exllamav3.util.progress import ProgressBar
+from exllamav3.util.memory import free_mem
+from datasets import load_dataset
+import math
+import argparse
+import json
+import matplotlib.pyplot as plt
+from adjustText import adjust_text
+import glob
+
+torch.set_printoptions(precision = 5, sci_mode = False, linewidth = 200)
+
+# Lookup tables to ensure test functions are cacheable
+
+from compare_q_transformers import (
+    load_transformers_auto,
+    load_transformers,
+    fwd_transformers,
+    tokenize_transformers
+)
+from compare_q_exllamav2 import (
+    load_exllamav2,
+    fwd_exllamav2
+)
+from compare_q_exllamav3 import (
+    load_exllamav3,
+    fwd_exllamav3
+)
+from compare_q_llamacpp import (
+    load_llamacpp,
+    fwd_llamacpp
+)
+
+load_fns = {
+    "transformers_auto": load_transformers_auto,
+    "transformers": load_transformers,
+    "exllamav2": load_exllamav2,
+    "exllamav3": load_exllamav3,
+    "llamacpp": load_llamacpp,
+}
+
+fwd_fns = {
+    "transformers": fwd_transformers,
+    "exllamav2": fwd_exllamav2,
+    "exllamav3": fwd_exllamav3,
+    "llamacpp": fwd_llamacpp,
+}
+
+tokenize_fns = {
+    "transformers": tokenize_transformers,
+}
+
+# Tokenize ppl test data
+
+@disk_lru_cache("get_dataset")
+def get_test_data(spec: dict):
+    tokenize_fn = tokenize_fns[spec["tokenize_fn"]]
+    assert spec["dataset"] == "wiki2", "Only wiki2 implemented atm"
+    eval_stride = spec["eval_stride"]
+    eval_len = spec["eval_len"]
+    max_rows = spec.get("max_rows", 0)
+    eval_tokens = tokenize_fn(
+        spec["tokenizer_dir"],
+        "\n\n".join(
+            load_dataset("wikitext", "wikitext-2-raw-v1", split = "test")
+            ["text"]
+        )
+    )
+    num_tokens = eval_tokens.shape[-1]
+    seqs = []
+    for a in range(0, num_tokens - eval_len, eval_stride):
+        b = a + eval_len
+        seqs.append(eval_tokens[:, a:b])
+        if max_rows and len(seqs) >= max_rows:
+            break
+    eval_tokens = torch.cat(seqs, dim = 0)[:, :]
+    return eval_tokens
+
+# Run ppl test
+
+@disk_lru_cache("test_ppl")
+def test_ppl(data_spec: dict, spec: dict):
+    load_fn = load_fns[spec["load_fn"]]
+    fwd_fn = fwd_fns[spec["fwd_fn"]]
+    model_dir = spec["model_dir"]
+
+    print(f"Loading dataset: {data_spec['dataset']}")
+    eval_ids = get_test_data(data_spec)
+    rows = eval_ids.shape[0]
+
+    print(f"Loading: {model_dir}")
+    model_instance, bpw_layer, bpw_head, vram_bits = load_fn(model_dir)
+    vram_gb = vram_bits / 8 / 1024**3
+
+    logprob_sum = 0.0
+    logprob_count = 0
+
+    print(f"Testing: {model_dir} ({spec['label']})")
+
+    with ProgressBar("Evaluating", rows) as pb:
+        for row in range(rows):
+            pb.update(row)
+            input_ids = eval_ids[row:row + 1, :]
+            logits = fwd_fn(model_instance, input_ids)
+            logits = logits[:, :-1, :].float() + 1e-10
+            log_probs = F.log_softmax(logits, dim = -1)
+            del logits
+            target_ids = input_ids[:, 1:].to(log_probs.device)
+            target_log_probs = log_probs.gather(-1, target_ids.unsqueeze(-1)).squeeze(-1)
+            del log_probs
+            logprob_sum += target_log_probs.sum().item()
+            logprob_count += target_ids.numel()
+            del target_log_probs
+            del target_ids
+        mean_log_prob = logprob_sum / logprob_count
+        perplexity = math.exp(-mean_log_prob)
+
+    print(f"Perplexity: {perplexity:.6f}")
+
+    del model_instance
+    del eval_ids
+
+    free_mem()
+    return {
+        "label": spec.get("label", spec.get("model_dir")),
+        "layer_bpw": bpw_layer,
+        "head_bpw": bpw_head,
+        "vram_gb": vram_gb,
+        "ppl": perplexity,
+    }
+
+
+def plot(results, args):
+
+    def get_color(s):
+        d = {
+            "EXL2": "green",
+            "EXL3": "purple",
+            "AWQ": "olive",
+            "imat": "brown",
+            "GGUF": "red",
+            "VPTQ": "blue",
+        }
+        for k, v in d.items():
+            if k in s:
+                return v
+        return "black"
+
+    plt.rcParams["figure.figsize"] = (14, 11)
+    plt.subplots_adjust(left = 0.05, right = 0.95, top = 0.95, bottom = 0.05)
+
+    lpoints = {}
+    x = []
+    y = []
+    labels = []
+    colors = []
+    for r in results:
+        x_ = r["vram_gb"] if args.vram else r["layer_bpw"]
+        y_ = r["ppl"]
+        if x_ > args.max_x or y_ > args.max_y:
+            continue
+        x.append(x_)
+        y.append(y_)
+        labels.append(r["label"] + f"\n{y_:.3f}")
+        color = get_color(r["label"])
+        colors.append(color)
+        if color != "black":
+            if color not in lpoints:
+                lpoints[color] = []
+            lpoints[color].append((x_, y_))
+
+    plt.scatter(x, y, c = colors, marker = "o")
+
+    texts = []
+    for i, label in enumerate(labels):
+        texts.append(
+            plt.text(
+                x[i],
+                y[i],
+                label,
+                fontsize = 8.5,
+                ha = "left",
+                va = "bottom",
+                color = colors[i],
+            )
+        )
+    adjust_text(
+        texts,
+        x = x,
+        y = y,
+        arrowprops = {"arrowstyle": "->", "color": "lightgray"},
+        expand = (1.35, 2.3),
+        ensure_inside_axes = True,
+        min_arrow_len = 0.10,
+        prevent_crossings = False,
+        pull_threshold = 0.20,
+        # force_explode = (0.2, 0.6),
+        max_move = 100
+    )
+
+    for col, lines in lpoints.items():
+        x, y = zip(*sorted(lines))
+        plt.plot(x, y, color = col, linestyle=':')
+
+    plt.xlabel("VRAM // GB (decoder + head)" if args.vram else "bits per weight (decoder only)")
+    plt.ylabel("Perplexity")
+    plt.title(args.title)
+    plt.grid(True)
+    plt.show()
+
+
+def main(args):
+    with open(args.dataspec, "r", encoding = "utf8") as f:
+        test_data_spec = json.load(f)
+
+    models_files = args.modelspec
+    models_files_g = []
+    models_spec = []
+    for filename in models_files:
+        if "*" in filename:
+            models_files_g += glob.glob(filename)
+        else:
+            models_files_g.append(filename)
+    for filename in models_files_g:
+        with open(filename, "r", encoding = "utf8") as f:
+            m = json.load(f)
+            models_spec += m
+
+    if args.clear_cache:
+        for spec in models_spec:
+            disk_lru_cache_clear("test_ppl", test_data_spec, spec)
+
+    results = []
+    for spec in models_spec:
+        r = test_ppl(test_data_spec, spec)
+        print(r)
+        results.append(r)
+
+    print("------")
+    print(json.dumps(results, indent = 4))
+
+    if args.plot:
+        plot(results, args)
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-d", "--dataspec", type = str, help = "Data specification (JSON file)")
+    parser.add_argument("-m", "--modelspec", type = str, nargs="+", help = "Model specification (JSONL file), accepts wildcard")
+    parser.add_argument("-cc", "--clear_cache", action = "store_true", help = "Clear cache")
+    parser.add_argument("-p", "--plot", action = "store_true", help = "Scatter plot")
+    parser.add_argument("-v", "--vram", action = "store_true", help = "Use VRAM footprint as scatter plot X axis")
+    parser.add_argument("-mx", "--max_x", type = float, default = 999999, help = "Don't plot results beyond X value")
+    parser.add_argument("-my", "--max_y", type = float, default = 999999, help = "Don't plot results beyond Y value")
+    parser.add_argument("-t", "--title", type = str, default = "Very plot", help = "Plot title")
+    _args = parser.parse_args()
+    main(_args)
+
+
--- a/eval/compare_q_exllamav2.py
+++ b/eval/compare_q_exllamav2.py
@@ -0,0 +1,34 @@
+import torch
+from exllamav2 import ExLlamaV2, ExLlamaV2Config, ExLlamaV2Cache
+from exllamav2.model import ExLlamaV2Linear
+
+def get_tensor_size(tensors):
+    return 8 * sum(t.element_size() * t.numel() for t in tensors.values())
+
+def get_storage_info(model):
+    sum_bits = 0
+    sum_numel = 0
+    head_bpw = 0
+    head_numel = 0
+    for key, module in model.modules_dict.items():
+        if module.key == "lm_head":
+            head_bpw = get_tensor_size(module.q_tensors) / module.numel()
+            head_numel = module.numel()
+        elif isinstance(module, ExLlamaV2Linear):
+            sum_bits += get_tensor_size(module.q_tensors)
+            sum_numel += module.numel()
+    vram_bits = head_numel * head_bpw + sum_bits
+    return sum_bits / sum_numel, head_bpw, vram_bits
+
+def load_exllamav2(model_dir: str | list):
+    config = ExLlamaV2Config(model_dir)
+    model = ExLlamaV2(config)
+    cache = ExLlamaV2Cache(model, batch_size = 1, max_seq_len = 2048)  # Cache isn't used but reqd by autosplit
+    model.load_autosplit(cache)
+    bpw_layer, bpw_head, vram_bits = get_storage_info(model)
+    return model, bpw_layer, bpw_head, vram_bits
+
+def fwd_exllamav2(model_instance, input_ids: torch.Tensor):
+    input_ids = input_ids.cpu()
+    output = model_instance.forward(input_ids)
+    return output
--- a/eval/compare_q_exllamav3.py
+++ b/eval/compare_q_exllamav3.py
@@ -0,0 +1,38 @@
+import torch
+from exllamav3 import Config, Model, Tokenizer, Cache
+from exllamav3.modules import Linear
+
+def get_tensor_size(tensors):
+    return 8 * sum(t.element_size() * t.numel() for t in tensors.values())
+
+def get_storage_info(model):
+    sum_bits = 0
+    sum_numel = 0
+    head_bpw = 0
+    head_numel = 0
+    for module in model:
+        if module.key == "lm_head":
+            head_bpw = get_tensor_size(module.get_tensors()) / module.weights_numel()
+            head_numel = module.weights_numel()
+        elif isinstance(module, Linear):
+            sum_bits += get_tensor_size(module.get_tensors())
+            sum_numel += module.weights_numel()
+    vram_bits = head_numel * head_bpw + sum_bits
+    return sum_bits / sum_numel, head_bpw, vram_bits
+
+def load_exllamav3(model_dir: str | list):
+    if isinstance(model_dir, list):
+        model_dir, override_tensors = model_dir
+        config = Config.from_directory(model_dir)
+        config.stc.add_tensor_files(override_tensors)
+    else:
+        config = Config.from_directory(model_dir)
+    model = Model.from_config(config)
+    model.load(max_output_size = 2048, max_output_factor = 3)
+    bpw_layer, bpw_head, vram_bits = get_storage_info(model)
+    return model, bpw_layer, bpw_head, vram_bits
+
+def fwd_exllamav3(model_instance, input_ids: torch.Tensor):
+    input_ids = input_ids.cpu()
+    output = model_instance.forward(input_ids, {"attn_mode": "flash_attn_nc"})
+    return output
--- a/eval/compare_q_llamacpp.py
+++ b/eval/compare_q_llamacpp.py
@@ -0,0 +1,61 @@
+try:
+    import llama_cpp
+    import gguf
+    from gguf import GGUFReader
+    from llama_cpp import Llama
+except:
+    pass
+import torch
+from functools import lru_cache
+from exllamav3.util.file import disk_lru_cache
+
+@lru_cache  # run once
+def init_backend():
+    llama_cpp.llama_backend_init(False)
+
+@disk_lru_cache("lcpp_get_storage_info")
+def get_storage_info(model_dir):
+    reader = GGUFReader(model_dir)
+    tensors = reader.tensors
+    sum_bits = 0
+    sum_numel = 0
+    head_bpw = 0
+    head_numel = 0
+    for tensor_info in tensors:
+        name = tensor_info.name
+        if any(name.endswith(k) for k in [
+            ".ffn_down.weight",
+            ".ffn_gate.weight",
+            ".ffn_up.weight",
+            ".attn_q.weight",
+            ".attn_k.weight",
+            ".attn_v.weight",
+            ".attn_output.weight",
+        ]):
+            sum_bits += tensor_info.n_bytes * 8
+            sum_numel += tensor_info.n_elements
+        if (name == "token_embd.weight" and head_bpw == 0) or \
+            name == "output.weight":
+            head_bpw = tensor_info.n_bytes * 8 / tensor_info.n_elements
+            head_numel = tensor_info.n_elements
+    vram_bits = head_numel * head_bpw + sum_bits
+    return sum_bits / sum_numel, head_bpw, vram_bits
+
+def load_llamacpp(model_dir: str):
+    init_backend()
+    model = Llama(
+        model_path = model_dir,
+        logits_all = True,
+        verbose = False,
+        n_ctx = 2048,
+        n_gpu_layers = 999
+    )
+    bpw_layer, bpw_head, vram_bits = get_storage_info(model_dir)
+    return model, bpw_layer, bpw_head, vram_bits
+
+def fwd_llamacpp(model_instance, input_ids: torch.Tensor):
+    input_ids_list = input_ids[0].tolist()
+    model_instance.reset()
+    model_instance.eval(input_ids_list)
+    logits = torch.from_numpy(model_instance.scores).unsqueeze(0).cuda()
+    return logits
--- a/eval/compare_q_transformers.py
+++ b/eval/compare_q_transformers.py
@@ -0,0 +1,106 @@
+import torch
+from transformers import AutoTokenizer, AutoModelForCausalLM
+from aqlm import QuantizedLinear
+from awq.modules.linear import WQLinear_GEMM
+from vptq import VQuantLinear
+from bitsandbytes.nn import Linear4bit
+
+def get_tensors_size(tensors):
+    return 8 * sum(t.element_size() * t.numel() for t in tensors.values() if t is not None)
+
+def get_tensor_size(tensor):
+    return 8 * tensor.element_size() * tensor.numel()
+
+def scan_gpu_tensors(obj, seen = None):
+    if seen is None:
+        seen = set()
+    obj_id = id(obj)
+    if obj_id in seen:
+        return 0
+    seen.add(obj_id)
+    total_size = 0
+    # If it's a GPU tensor, add its memory usage.
+    if isinstance(obj, torch.Tensor) and obj.is_cuda:
+        total_size += obj.element_size() * obj.nelement()
+    else:
+        if isinstance(obj, dict):
+            for key, value in obj.items():
+                total_size += scan_gpu_tensors(key, seen)
+                total_size += scan_gpu_tensors(value, seen)
+            return total_size
+        if isinstance(obj, (list, tuple, set)):
+            for item in obj:
+                total_size += scan_gpu_tensors(item, seen)
+            return total_size
+        if hasattr(obj, '__dict__'):
+            total_size += scan_gpu_tensors(vars(obj), seen)
+        if hasattr(obj, '__slots__'):
+            for slot in obj.__slots__:
+                try:
+                    attr = getattr(obj, slot)
+                    total_size += scan_gpu_tensors(attr, seen)
+                except AttributeError:
+                    continue
+    return total_size
+
+def get_storage_info(model):
+    sum_bits = 0
+    sum_numel = 0
+    head_bpw = 0
+    head_numel = 0
+    for name, module in model.named_modules():
+        if any(isinstance(module, x) for x in [Linear4bit]):
+            if module.out_features >= model.vocab_size * 0.9:  # this is foolproof
+                head_numel = module.in_features * module.out_features
+                head_bpw = module.weight.numel() * 8
+                head_bpw = (head_bpw + scan_gpu_tensors(module.quant_state) * 8) / head_numel
+            else:
+                sum_bits += module.weight.numel() * 8
+                sum_bits += scan_gpu_tensors(module.quant_state) * 8
+                sum_numel += module.in_features * module.out_features
+        elif any(isinstance(module, x) for x in [torch.nn.Linear]):
+            if module.out_features >= model.vocab_size * 0.9:
+                head_bpw = module.weight.element_size() * 8
+                head_numel = module.weight.numel()
+            else:
+                sum_bits += get_tensor_size(module.weight)
+                sum_numel +=  module.weight.numel()
+        elif any(isinstance(module, x) for x in [QuantizedLinear, VQuantLinear]):
+            sum_bits += get_tensors_size(dict(module.named_parameters()))
+            sum_numel += module.in_features * module.out_features
+        elif any(isinstance(module, x) for x in [WQLinear_GEMM]):
+            sum_bits += get_tensors_size({
+                "qweight": module.qweight,
+                "qzeros": module.qzeros,
+                "scales": module.scales,
+            })
+            sum_numel += module.in_features * module.out_features
+
+    vram_bits = head_numel * head_bpw + sum_bits
+    return sum_bits / sum_numel, head_bpw, vram_bits
+
+@torch.inference_mode
+def load_transformers(model_dir: str, auto = False):
+    model = AutoModelForCausalLM.from_pretrained(
+        model_dir,
+        device_map = "auto" if auto else "cuda:0",
+        torch_dtype = torch.half
+    )
+    bpw_layer, bpw_head, vram_bits = get_storage_info(model)
+    return model, bpw_layer, bpw_head, vram_bits
+
+@torch.inference_mode
+def load_transformers_auto(model_dir: str):
+    return load_transformers(model_dir, auto = True)
+
+@torch.inference_mode
+def fwd_transformers(model_instance, input_ids: torch.Tensor):
+    input_ids = input_ids.to("cuda:0")
+    output = model_instance(input_ids)
+    return output.logits
+
+@torch.inference_mode
+def tokenize_transformers(tokenizer_dir: str, text: str):
+    tokenizer = AutoTokenizer.from_pretrained(tokenizer_dir)
+    output = tokenizer(text, return_tensors="pt")
+    return output.input_ids
--- a/eval/humaneval.py
+++ b/eval/humaneval.py
@@ -0,0 +1,168 @@
+from __future__ import annotations
+import os, sys
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav3 import model_init, Generator, Job, TopPSampler
+from exllamav3.util.progress import ProgressBar
+import argparse, contextlib, subprocess
+from human_eval.data import write_jsonl, read_problems
+from pathlib import Path
+
+# Prompt formats
+prompt_formats = {
+    "raw": (
+        "```python\n{{problem}}    ",
+        "    "
+    ),
+    "granite": (
+        "Question:\nComplete the following Python function:\n\n{{problem}}\n\nAnswer:\n"
+        "Sure! Here is how you might implement the function:\n\n```python\n{{problem}}",
+        "    "
+    ),
+    "llama": (
+        "[INST] <<SYS>>\n"
+        "You are a helpful AI coding assistant.\n"
+        "<</SYS>>\n\n"
+        "Complete the following Python function:\n\n"
+        "{{problem}} [/INST] "
+        "Sure! Here is how you might implement the function:\n\n```python\n{{problem}}",
+        "    "
+    ),
+    "llama3": (
+        "<|start_header_id|>system<|end_header_id|>\n\n"
+        "You are a helpful AI coding assistant.<|eot_id|>"
+        "<|start_header_id|>user<|end_header_id|>\n\n"
+        "Complete the following Python function:\n\n{{problem}}<|eot_id|>"
+        "<|start_header_id|>assistant<|end_header_id|>\n\n"
+        "Sure! Here is how you might implement the function:\n\n```python\n{{problem}}",
+        "    "
+    ),
+    "mistral": (
+        "[INST] You are a helpful AI coding assistant.\n\n"
+        "Complete the following Python function:\n\n"
+        "{{problem}}[/INST]"
+        " Sure! Here is how you might implement the function:\n\n```python\n{{problem}}",
+        "    "
+    ),
+    "gemma": (
+        "<bos><start_of_turn>user\n"
+        "Complete the following Python function:\n\n{{problem}}<|eot_id|>"
+        "<start_of_turn>model\n"
+        "```python\n{{problem}}",
+        "    "
+    )
+}
+
+def main(args):
+
+    # Validate args
+    directory = os.path.dirname(args.output)
+    if os.path.exists(args.output):
+        print(f" !! Warning: Output file exists and will be overwritten.")
+
+    if args.prompt_format is None:
+        prompt_format, prefix = "{{problem}}", "    "
+    elif args.prompt_format in prompt_formats:
+        prompt_format, prefix = prompt_formats[args.prompt_format]
+    else:
+        print("Prompt format is not supported. Available formats:")
+        print("\n".join(prompt_formats.keys()))
+        sys.exit()
+
+    # Initialize
+    model, config, cache, tokenizer = model_init.init(args)
+    generator = Generator(
+        model = model,
+        cache = cache,
+        max_batch_size = 256,
+        tokenizer = tokenizer
+    )
+    sampler = TopPSampler(
+        top_p = args.top_p,
+        temperature = args.temperature
+    )
+
+    # Get problems
+    problems = read_problems()
+    num_samples_per_task = args.samples_per_task
+
+    # Create jobs
+    with ProgressBar("Creating sample jobs", len(problems), transient = False) as progress:
+        for idx, (problem_id, problem) in enumerate(problems.items()):
+            b_problem = problem["prompt"]
+            f_problem = prompt_format.replace("{{problem}}", b_problem)
+            input_ids = tokenizer.encode(f_problem, encode_special_tokens = True, add_bos = True)
+            for s in range(num_samples_per_task):
+                job = Job(
+                    input_ids = input_ids,
+                    sampler = sampler,
+                    max_new_tokens = args.max_tokens,
+                    stop_conditions = [tokenizer.eos_token_id],
+                    token_healing = True,
+                    identifier = (problem_id, s),
+                    min_new_tokens = 6
+                )
+                generator.enqueue(job)
+            progress.update(idx)
+
+    # Collect samples here
+    samples = []
+
+    # Work
+    total_jobs = generator.num_remaining_jobs()
+    with ProgressBar("Generating samples" if not args.verbose else None, total_jobs, transient = False) as progress:
+
+        while generator.num_remaining_jobs():
+            results = generator.iterate()
+            for result in results:
+
+                # End sample if generator says EOS or if there is a non-indented line at the end of the output
+                job = result["job"]
+                eos = False
+                completion = job.full_completion
+                last_newline_index = completion.rfind("\n")
+                if last_newline_index >= 0:
+                    last_line = completion[last_newline_index + 1:]
+                    if last_line != "" and not last_line[0].isspace():
+                        completion = completion[:last_newline_index]
+                        eos = True
+                eos = eos or result["eos"]
+
+                # Collect completed sample
+                if eos:
+                    identifier = result["identifier"]
+                    sample = problems[identifier[0]]["prompt"] + prefix + completion.strip()
+                    if not result["eos"]:
+                        generator.cancel(job)
+
+                    if args.verbose:
+                        print("----------------------------------------------------------------------")
+                        print(f" ** Problem {identifier[0]}, sample {identifier[1] + 1} / {num_samples_per_task}")
+                        print("----------------------------------------------------------------------")
+                        print(sample)
+                        print()
+                    progress.update(total_jobs - generator.num_remaining_jobs())
+                    samples.append(dict(task_id = identifier[0], completion = prefix + completion.strip()))
+
+    # Save output
+    print(f" -- Saving: {args.output}")
+    Path(directory).mkdir(parents = True, exist_ok = True)
+    write_jsonl(args.output, samples)
+
+    # Optionally launch eval script
+    if args.eval:
+        subprocess.run(["evaluate_functional_correctness", args.output])
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser(description = "Run HumanEval evaluation")
+    model_init.add_args(parser)
+    parser.add_argument("-o", "--output", type = str, help = "Output .jsonl filename", required = True)
+    parser.add_argument("-spt", "--samples_per_task", type = int, default = 200)
+    parser.add_argument("-pf", "--prompt_format", type = str, help = "Instruct format to apply. Default is raw completion (for base models) ")
+    parser.add_argument("-v", "--verbose", action = "store_true", help = "Spam completions to console while generating")
+    parser.add_argument("-e", "--eval", action = "store_true", help = "Run evaluation script on output file after sampling")
+    parser.add_argument("-temp", "--temperature", type = float, help = "Sampling temperature (0 for greedy), default: 0.6", default = 0.6)
+    parser.add_argument("-topp", "--top_p", type = float, help = "Top-p sampling, default: 0.6", default = 0.6)
+    parser.add_argument("--max_tokens", type = int, default = 768, help = "Max number of tokens for each completion")
+    _args = parser.parse_args()
+    main(_args)
--- a/eval/model_diff.py
+++ b/eval/model_diff.py
@@ -0,0 +1,95 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import argparse
+from exllamav3.util.file import disk_lru_cache, disk_lru_cache_clear
+from exllamav3.util.progress import ProgressBar
+from exllamav3.util.memory import free_mem
+from exllamav3 import Config, Model, Cache, Tokenizer
+from datasets import load_dataset
+import torch
+import torch.nn.functional as F
+import math
+
+
+@disk_lru_cache("get_dataset_text")
+def get_dataset_text(spec: dict):
+    assert spec["dataset"] == "wiki2", "Only wiki2 implemented atm"
+    dataset_text = "\n\n".join(
+        load_dataset("wikitext", "wikitext-2-raw-v1", split = "test")
+        ["text"]
+    )
+    return dataset_text
+
+
+def get_test_tokens(tokenizer, rows, eval_len = 2048, eval_stride = 512):
+    with ProgressBar("Tokenizing", rows) as pb:
+        dataset_spec = { "dataset": "wiki2" }
+        eval_tokens = tokenizer.encode(get_dataset_text(dataset_spec))
+        num_tokens = eval_tokens.shape[-1]
+        seqs = []
+        for a in range(0, num_tokens - eval_len, eval_stride):
+            b = a + eval_len
+            seqs.append(eval_tokens[:, a:b])
+            pb.update(len(seqs))
+            if len(seqs) >= rows:
+                break
+    return torch.cat(seqs, dim = 0)[:, :]
+
+
+@torch.inference_mode()
+def main(args):
+
+    config_a = Config.from_directory(args.model_a)
+    config_a.override_dynamic_seq_len(2048)
+    tokenizer = Tokenizer.from_config(config_a)
+    model_a = Model.from_config(config_a)
+
+    config_b = Config.from_directory(args.model_b)
+    config_b.override_dynamic_seq_len(2048)
+    model_b = Model.from_config(config_b)
+
+    # Dataset
+    eval_ids = get_test_tokens(tokenizer, args.rows)
+    state_a = eval_ids
+    state_b = eval_ids
+
+    for idx, (module_a, module_b) in enumerate(zip(model_a.modules, model_b.modules)):
+
+        module_a.load("cuda:0" if not module_a.caps.get("prefer_cpu") else "cpu")
+        params_a = {}
+        state_a = module_a.prepare_for_device(state_a, params_a)
+        state_a = module_a.forward(state_a, params_a)
+        module_a.unload()
+        free_mem()
+
+        module_b.load("cuda:0" if not module_b.caps.get("prefer_cpu") else "cpu")
+        params_b = {}
+        state_b = module_b.prepare_for_device(state_b, params_b)
+        state_b = module_b.forward(state_b, params_b)
+        module_b.unload()
+        free_mem()
+
+        max_diff = 0
+        rfn_error_sum = 0
+        rows = state_a.shape[0]
+        for i in range(rows):
+            sa = state_a[i].to(float, copy = True)
+            sb = state_b[i].to(float)
+            sa -= sb
+            rfn_error_sum += (torch.linalg.norm(sa, 'fro') / torch.linalg.norm(sb, 'fro').mean()).item()
+            sa.abs_()
+            md = ((sa.max().item()) / torch.linalg.norm(sb, 'fro').mean()).item()
+            max_diff = max(max_diff, md)
+
+        rfn_error = rfn_error_sum / rows
+        print(f" -- {module_a.key:40}   error: {rfn_error:.6f}   max_diff/norm: {max_diff:.6f}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    parser.add_argument("-ma", "--model_a", type = str, help = "Model A", required = True)
+    parser.add_argument("-mb", "--model_b", type = str, help = "Model B", required = True)
+    parser.add_argument("-r", "--rows", type = int, help = "Number of rows", default = 100)
+    _args = parser.parse_args()
+    main(_args)
--- a/eval/ppl.py
+++ b/eval/ppl.py
@@ -0,0 +1,92 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import argparse
+from exllamav3.util.file import disk_lru_cache, disk_lru_cache_clear
+from exllamav3.util.progress import ProgressBar
+from exllamav3.util.memory import free_mem
+from exllamav3 import Config, Model, Cache, Tokenizer, model_init
+from datasets import load_dataset
+import torch
+import torch.nn.functional as F
+import math
+
+
+@disk_lru_cache("get_dataset_text")
+def get_dataset_text(spec: dict):
+    assert spec["dataset"] == "wiki2", "Only wiki2 implemented atm"
+    dataset_text = "\n\n".join(
+        load_dataset("wikitext", "wikitext-2-raw-v1", split = "test")
+        ["text"]
+    )
+    return dataset_text
+
+
+def get_test_tokens(tokenizer, rows, eval_len = 2048, eval_stride = 512):
+    with ProgressBar("Tokenizing", rows) as pb:
+        dataset_spec = { "dataset": "wiki2" }
+        eval_tokens = tokenizer.encode(get_dataset_text(dataset_spec))
+        num_tokens = eval_tokens.shape[-1]
+        seqs = []
+        for a in range(0, num_tokens - eval_len, eval_stride):
+            b = a + eval_len
+            seqs.append(eval_tokens[:, a:b])
+            pb.update(len(seqs))
+            if len(seqs) >= rows:
+                break
+    return torch.cat(seqs, dim = 0)[:, :]
+
+
+@torch.inference_mode()
+def main(args):
+
+    # Load model
+    # TODO: inplace softmax, reduce max_output_factor to 3
+    model, config, _, tokenizer = model_init.init(
+        args,
+        override_dynamic_seq_len = 2048,
+        max_output_size = 2048,
+        max_output_factor = 5,
+    )
+
+    vocab_size = tokenizer.actual_vocab_size
+    bpw_layer, bpw_head, vram_bits = model.get_storage_info()
+
+    # Dataset
+    eval_ids = get_test_tokens(tokenizer, args.rows)
+
+    # Test
+    logprob_sum = 0.0
+    logprob_count = 0
+    with ProgressBar("Evaluating", args.rows) as pb:
+        for row in range(eval_ids.shape[0]):
+            pb.update(row)
+            input_ids = eval_ids[row:row + 1, :]
+            logits = model.forward(input_ids, {"attn_mode": "flash_attn_nc"})
+            logits = logits[:, :-1, :vocab_size].float() + 1e-10
+            log_probs = F.log_softmax(logits, dim = -1)
+            del logits
+            target_ids = input_ids[:, 1:].to(log_probs.device)
+            del input_ids
+            target_log_probs = log_probs.gather(-1, target_ids.unsqueeze(-1)).squeeze(-1)
+            logprob_sum += target_log_probs.sum().item()
+            logprob_count += target_ids.numel()
+            del target_log_probs
+            del target_ids
+            torch.cuda.empty_cache()
+        pb.update(args.rows)
+        mean_log_prob = logprob_sum / logprob_count
+        perplexity = math.exp(-mean_log_prob)
+
+    print(f" -- Model: {args.model_dir}")
+    print(f" -- Bitrate: {bpw_layer:.2f} bpw / {bpw_head:.2f} bpw (head)")
+    print(f" -- Evaluated: {eval_ids.shape[0]} rows of {eval_ids.shape[1]} tokens")
+    print(f" -- Perplexity: {perplexity:.6f}")
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    model_init.add_args(parser, cache = False)
+    parser.add_argument("-r", "--rows", type = int, help = "Number of rows", default = 100)
+    _args = parser.parse_args()
+    main(_args)
--- a/eval/spec/llama3.1-70b-instruct_aqlm.json
+++ b/eval/spec/llama3.1-70b-instruct_aqlm.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/aqlm/2bit-1x16-pv-g1/",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AQLM 2bit 1x16-g1"
+    }
+]
--- a/eval/spec/llama3.1-70b-instruct_awq.json
+++ b/eval/spec/llama3.1-70b-instruct_awq.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/awq/4bit",
+        "load_fn": "transformers_auto",
+        "fwd_fn": "transformers",
+        "label": "AWQ 4bit"
+    }
+]
--- a/eval/spec/llama3.1-70b-instruct_exl2.json
+++ b/eval/spec/llama3.1-70b-instruct_exl2.json
@@ -0,0 +1,44 @@
+[
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 2.4bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/2.4bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 2.8bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/2.8bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/4.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl2/6.0bpw/"
+    }
+]
--- a/eval/spec/llama3.1-70b-instruct_exl3.json
+++ b/eval/spec/llama3.1-70b-instruct_exl3.json
@@ -0,0 +1,76 @@
+[
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.6bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3_x/1.6bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.6bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/1.6bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.8bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/1.8bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/2.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.25bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/2.25bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/2.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 4.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/4.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/exl3/6.0bpw/"
+    }
+]
+
+
--- a/eval/spec/llama3.1-70b-instruct_gguf.json
+++ b/eval/spec/llama3.1-70b-instruct_gguf.json
@@ -0,0 +1,62 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/llama-3.1-70b-instruct-iq1_m.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ1_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/llama-3.1-70b-instruct-iq2_s.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/llama-3.1-70b-instruct-iq2_xxs.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XXS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-IQ1_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ1_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-IQ3_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-IQ3_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-IQ3_XXS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_XXS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-IQ4_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ4_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-Q4_K_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q4_K_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/gguf/Meta-Llama-3.1-70B-Instruct.i1-Q5_K_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q5_K_M imat"
+    }
+]
--- a/eval/spec/llama3.1-70b-instruct_vptq.json
+++ b/eval/spec/llama3.1-70b-instruct_vptq.json
@@ -0,0 +1,14 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/vptq/v8-k65536-0-woft",
+        "load_fn": "transformers_auto",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v8-k65536-0-woft"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-70b-instruct/vptq/v16-k65536-32768-woft",
+        "load_fn": "transformers_auto",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v16-k65536-32768-woft"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_aqlm.json
+++ b/eval/spec/llama3.1-8b-instruct_aqlm.json
@@ -0,0 +1,20 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/aqlm/2bit-1x16-g8",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AQLM 2bit 1x16-g8"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/aqlm/2bit-1x16-g8-pv",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AQLM 2bit 1x16-g8-pv"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/aqlm/2bit-2x8-g8-pv",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AQLM 2bit 2x8-g8-pv"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_exl2.json
+++ b/eval/spec/llama3.1-8b-instruct_exl2.json
@@ -0,0 +1,44 @@
+[
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/4.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/6.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 8.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl2/8.0bpw/"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_exl3.json
+++ b/eval/spec/llama3.1-8b-instruct_exl3.json
@@ -0,0 +1,74 @@
+[
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.7bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/1.7bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.8bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/1.8bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 1.9bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/1.9bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/2.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.25bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/2.25bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/2.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/5.0bpw/"
+    },
+        {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/6.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 8.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/8.0bpw/"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_gguf.json
+++ b/eval/spec/llama3.1-8b-instruct_gguf.json
@@ -0,0 +1,74 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ1_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ1_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ2_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ2_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ2_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ2_XXS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XXS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ3_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ3_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ3_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-IQ4_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ4_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-Q4_K_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q4_K_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-Q5_K_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q5_K_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/gguf/Meta-Llama-3.1-8B-Instruct.i1-Q6_K.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q6_K imat"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_hf.json
+++ b/eval/spec/llama3.1-8b-instruct_hf.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/hf",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "HF FP16"
+    }
+]
--- a/eval/spec/llama3.1-8b-instruct_vptq.json
+++ b/eval/spec/llama3.1-8b-instruct_vptq.json
@@ -0,0 +1,26 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/vptq/v8-k65536-256-woft",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v8-k65536-256-woft"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/vptq/v8-k65536-4096-woft",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v8-k65536-4096-woft"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/vptq/v8-k65536-65536-woft",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v8-k65536-65536-woft"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.1-8b-instruct/vptq/v12-k65536-4096-woft",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "VPTQ v12-k65536-4096-woft"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_aqlm.json
+++ b/eval/spec/llama3.2-1b-instruct_aqlm.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/aqlm/2bit-2x8-g1-pv",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AQLM 2bit 2x8-g1-pv"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_awq.json
+++ b/eval/spec/llama3.2-1b-instruct_awq.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/awq/4bit",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "AWQ 4-bit"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_bnb.json
+++ b/eval/spec/llama3.2-1b-instruct_bnb.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/bnb/bnb-4bit/",
+        "load_fn": "transformers",
+        "fwd_fn": "transformers",
+        "label": "BNB 4-bit"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_exl2.json
+++ b/eval/spec/llama3.2-1b-instruct_exl2.json
@@ -0,0 +1,50 @@
+[
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 2.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/2.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/4.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/6.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 8.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl2/8.0bpw/"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_exl3.json
+++ b/eval/spec/llama3.2-1b-instruct_exl3.json
@@ -0,0 +1,62 @@
+[
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.0bpw H3",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/2.0bpw_H3/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/2.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.25bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/2.25bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/2.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/6.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 8.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/exl3/8.0bpw/"
+    }
+]
--- a/eval/spec/llama3.2-1b-instruct_gguf.json
+++ b/eval/spec/llama3.2-1b-instruct_gguf.json
@@ -0,0 +1,53 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ2_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ2_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ2_XXS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XXS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ3_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ3_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-IQ4_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ4_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-Q5_K_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q5_K_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/llama3.2-1b-instruct/gguf/Llama-3.2-1B-Instruct.i1-Q6_K.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q6_K imat"
+    }
+]
+
+
+
--- a/eval/spec/mistral-7b-instruct-v0.3_awq.json
+++ b/eval/spec/mistral-7b-instruct-v0.3_awq.json
@@ -0,0 +1,8 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/awq/4bit/",
+        "load_fn": "transformers_auto",
+        "fwd_fn": "transformers",
+        "label": "AWQ 4bit"
+    }
+]
--- a/eval/spec/mistral-7b-instruct-v0.3_exl2.json
+++ b/eval/spec/mistral-7b-instruct-v0.3_exl2.json
@@ -0,0 +1,38 @@
+[
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 2.8bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/2.8bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 4.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/4.5bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/5.0bpw/"
+    },
+    {
+        "load_fn": "exllamav2",
+        "fwd_fn": "exllamav2",
+        "label": "EXL2 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl2/6.0bpw/"
+    }
+]
--- a/eval/spec/mistral-7b-instruct-v0.3_exl3.json
+++ b/eval/spec/mistral-7b-instruct-v0.3_exl3.json
@@ -0,0 +1,56 @@
+[
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/2.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.25bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/2.25bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 2.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/2.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/3.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 3.5bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/3.5bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 4.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/4.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 5.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/5.0bpw/"
+    },
+        {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 6.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/6.0bpw/"
+    },
+    {
+        "load_fn": "exllamav3",
+        "fwd_fn": "exllamav3",
+        "label": "EXL3 8.0bpw H6",
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/exl3/8.0bpw/"
+    }
+]
--- a/eval/spec/mistral-7b-instruct-v0.3_gguf.json
+++ b/eval/spec/mistral-7b-instruct-v0.3_gguf.json
@@ -0,0 +1,50 @@
+[
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-IQ2_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-IQ2_S.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ2_S imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-IQ3_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-IQ4_XS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ4_XS imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-Q2_K.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q2_K imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-Q3_K_M.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q3_K_M imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-Q6_K.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF Q6_K imat"
+    },
+    {
+        "model_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/gguf/Mistral-7B-Instruct-v0.3.i1-IQ3_XXS.gguf",
+        "load_fn": "llamacpp",
+        "fwd_fn": "llamacpp",
+        "label": "GGUF IQ3_XXS imat"
+    }
+]
--- a/eval/spec/wiki2_llama3.json
+++ b/eval/spec/wiki2_llama3.json
@@ -0,0 +1,8 @@
+{
+    "tokenize_fn": "transformers",
+    "tokenizer_dir": "/mnt/str/eval_models/llama3.2-1b/hf/",
+    "dataset": "wiki2",
+    "eval_stride": 512,
+    "eval_len": 2048,
+    "max_rows": 20
+}
--- a/eval/spec/wiki2_llama3_large.json
+++ b/eval/spec/wiki2_llama3_large.json
@@ -0,0 +1,8 @@
+{
+    "tokenize_fn": "transformers",
+    "tokenizer_dir": "/mnt/str/eval_models/llama3.2-1b/hf/",
+    "dataset": "wiki2",
+    "eval_stride": 512,
+    "eval_len": 2048,
+    "max_rows": 100
+}
--- a/eval/spec/wiki2_mistral_large.json
+++ b/eval/spec/wiki2_mistral_large.json
@@ -0,0 +1,8 @@
+{
+    "tokenize_fn": "transformers",
+    "tokenizer_dir": "/mnt/str/eval_models/mistral-7b-instruct-v0.3/hf/",
+    "dataset": "wiki2",
+    "eval_stride": 512,
+    "eval_len": 2048,
+    "max_rows": 100
+}
--- a/examples/async_generator.py
+++ b/examples/async_generator.py
@@ -0,0 +1,89 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav3 import Model, Config, Cache, Tokenizer, AsyncGenerator, AsyncJob, Sampler
+import asyncio
+
+"""
+The async generator is a wrapper class that allows you to treat generator jobs as asynchronous iterators, while
+still letting concurrent jobs benefit from batching. Here is a simple example using asyncio.gather to launch a
+batch of async tasks at once.
+"""
+
+async def main():
+
+    # Load model etc.
+    config = Config.from_directory("/mnt/str/eval_models/llama3.1-8b-instruct/exl3/4.0bpw/")
+    model = Model.from_config(config)
+    cache = Cache(model, max_num_tokens = 32768)
+    model.load()
+    tokenizer = Tokenizer.from_config(config)
+
+    # Initialize the async generator with default settings
+    generator = AsyncGenerator(
+        model = model,
+        cache = cache,
+        tokenizer = tokenizer,
+    )
+
+    # Define a couple of prompts
+    prompts = [
+        "Once upon a time, there was",
+        "asyncio in Python is a great feature because",
+        "asyncio in Python is a pain to work with because",
+    ]
+
+    # Async task running async job in the async generator
+    async def run_job(prompt: str, marker: str):
+
+        # Create an asynchronous job. The job presents as an iterator which is transparently batched with other
+        # concurrent jobs for the same generator.
+        job = AsyncJob(
+            generator,
+            input_ids = tokenizer.encode(prompt, add_bos = False),
+            max_new_tokens = 200
+        )
+
+        # Iterate over the job. Each returned result is a dictionary containing an update on the status of the
+        # job and/or part of the completion (see the definition of Job.iterate() for details). The iterator ends
+        # when the job is complete (i.e. EOS or max_new_tokens is reached)
+        full_completion = prompt
+        async for result in job:
+            # We'll only collect text here, but the result could contain other updates
+            full_completion += result.get("text", "")
+
+            # Output marker to console to confirm that tasks running asynchronously, and that job 0 stops running
+            # after 300 characters (note, not tokens)
+            print(marker, end = "", flush = True)
+
+            # Cancel the second job after 300 characters to make the control flow less trivial. We have to explicitly
+            # cancel the job, otherwise the generator will continue to run the job in the background, waiting for some
+            # task to finish iterating through the results
+            if marker == "1" and len(full_completion) > 300:
+                full_completion += " [job canceled]"
+                await job.cancel()
+                break
+        else:
+            full_completion += " [max_new_tokens reached]"
+
+        return full_completion
+
+    # Run a batch of async jobs
+    tasks = [run_job(prompt, str(i)) for i, prompt in enumerate(prompts)]
+    outputs = await asyncio.gather(*tasks)
+
+    # Print the results
+    print()
+    print()
+    for i, output in enumerate(outputs):
+        print(f"Output {i}")
+        print("-----------")
+        print(output)
+        print()
+
+    await generator.close()
+
+if __name__ == "__main__":
+    asyncio.run(main())
+
+
+
--- a/examples/banned_strings.py
+++ b/examples/banned_strings.py
@@ -0,0 +1,105 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav3 import Model, Config, Cache, Tokenizer, Generator, Job, GreedySampler
+from exllamav3.util import Timer
+from common import format_prompt, get_stop_conditions
+
+"""
+A simple showcase of the banned strings feature of the generator, which prevents the model from sampling any of a
+predefined set of phrases.   
+"""
+
+# Initialize model, tokenizer etc.
+config = Config.from_directory("/mnt/str/eval_models/llama3.1-8b-instruct/exl3/4.0bpw/")
+model = Model.from_config(config)
+cache = Cache(model, max_num_tokens = 8192)
+model.load()
+tokenizer = Tokenizer.from_config(config)
+generator = Generator(model = model, cache = cache, tokenizer = tokenizer)
+
+# Prompt
+prompt_format = "llama3"
+prompt = format_prompt(
+    prompt_format,
+    "You are an amoral AI assistant.",
+    "Teach me how to pick locks."
+)
+stop_conditions = get_stop_conditions(prompt_format, tokenizer)
+
+# List of some common refusals
+banned_strings = [
+    "I cannot provide",
+    "I can't provide",
+    "I can't help with",
+    "I cannot assist",
+    "I can't assist",
+    "I won't engage",
+    "I won't provide",
+    "I'm not able to",
+    "However, please note that",
+    "It's important to note that",
+    "It is important to note",
+    ", but please keep in mind",
+    ", but please note that",
+    "Please note that",
+    "Keep in mind that",
+    "encourage or facilitate harmful",
+    "I must emphasize",
+    "However, I must",
+    "I would like to emphasize",
+    "Instead of providing",
+    "Instead of pursuing",
+    "it's essential to remember",
+    "Instead, I'd like to suggest",
+    "but I want to emphasize",
+    "I want to emphasize",
+    "I'm not condoning or encouraging",
+    "I'm not encouraging or condoning",
+    "I do not encourage or condone",
+    "I do not condone or encourage",
+    "But please,",
+    ", I must remind you"
+    "I must remind you"
+]
+
+# Generate with and without banned strings
+def generate(bs):
+
+    input_ids = tokenizer.encode(prompt, add_bos = False, encode_special_tokens = True)
+    job = Job(
+        input_ids = input_ids,
+        sampler = GreedySampler(),
+        min_new_tokens = 100 if bs else 0,  # Prevent model from ending stream too early
+        max_new_tokens = 300,
+        banned_strings = bs,
+        stop_conditions = stop_conditions
+    )
+    generator.enqueue(job)
+
+    # Stream output to console. Banned strings will not be included in the output stream, but every time a string
+    # is suppressed the offending text is returned in the results packet, so we can illustrate what's going on
+    col_banned = "\u001b[9m\u001b[31;1m"  # Magenta, strikethrough
+    col_default = "\u001b[0m"
+
+    while generator.num_remaining_jobs():
+        results = generator.iterate()
+        for result in results:
+            if "text" in result:
+                print(result["text"], end = "", flush = True)
+            if "suppressed_text" in result:
+                print(col_banned + result["suppressed_text"] + col_default, end = "", flush = True)
+    print()
+
+print("--------------------------------------------------------------------------------------")
+print("Without banned strings")
+print("--------------------------------------------------------------------------------------")
+
+generate(bs = None)
+print()
+
+print("--------------------------------------------------------------------------------------")
+print("With banned strings")
+print("--------------------------------------------------------------------------------------")
+
+generate(bs = banned_strings)
+print()
--- a/examples/chat.py
+++ b/examples/chat.py
@@ -0,0 +1,112 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+import argparse
+from exllamav3 import Generator, Job, model_init
+from chat_templates import *
+import torch
+
+# ANSI color codes
+col_default = "\u001b[0m"
+col_user = "\u001b[33;1m"  # Yellow
+col_bot = "\u001b[34;1m"  # Blue
+col_error = "\u001b[31;1m"  # Magenta
+col_sysprompt = "\u001b[37;1m"  # Grey
+
+@torch.inference_mode()
+def main(args):
+
+    # Prompt format
+    if args.modes:
+        print("Available modes:")
+        for k, v in prompt_formats.items():
+            print(f" - {k:16} {v.description}")
+        return
+
+    user_name = args.user_name
+    bot_name = args.bot_name
+    prompt_format = prompt_formats[args.mode](user_name, bot_name)
+    system_prompt = prompt_format.default_system_prompt() if not args.system_prompt else args.system_prompt
+    add_bos = prompt_format.add_bos()
+    max_response_tokens = args.max_response_tokens
+
+    # Load model
+    model, config, cache, tokenizer = model_init.init(args)
+    context_length = cache.max_num_tokens
+
+    # Generator
+    generator = Generator(
+        model = model,
+        cache = cache,
+        tokenizer = tokenizer,
+    )
+    stop_conditions = prompt_format.stop_conditions(tokenizer)
+
+    # Main loop
+    print("\n" + col_sysprompt + system_prompt.strip() + col_default)
+    context = []
+
+    while True:
+
+        # Get user prompt and add to context
+        print("\n" + col_user + user_name + ": " + col_default, end = '', flush = True)
+        if args.mli:
+            user_prompt = sys.stdin.read().rstrip()
+        else:
+            user_prompt = input().strip()
+        context.append((user_prompt, None))
+
+        # Tokenize context and trim from head if too long
+        def get_input_ids():
+            frm_context = prompt_format.format(system_prompt, context)
+            ids_ = tokenizer.encode(frm_context, add_bos = add_bos, encode_special_tokens = True)
+            exp_len_ = ids_.shape[-1] + max_response_tokens + 1
+            return ids_, exp_len_
+
+        ids, exp_len = get_input_ids()
+        if exp_len > context_length:
+            while exp_len > context_length - 2 * max_response_tokens:
+                context = context[1:]
+                ids, exp_len = get_input_ids()
+
+        # Inference
+        print("\n" + col_bot + bot_name + ": " + col_default, end = "")
+        job = Job(
+            input_ids = ids,
+            max_new_tokens =  max_response_tokens,
+            stop_conditions = stop_conditions
+        )
+        generator.enqueue(job)
+
+        # Stream response
+        response = ""
+        while generator.num_remaining_jobs():
+            for r in generator.iterate():
+                chunk = r.get("text", "")
+                if not response and chunk.startswith(" "):
+                    print(chunk[1:], end = "", flush = True)
+                else:
+                    print(chunk, end = "", flush = True)
+                response += chunk
+                if r["eos"] and r["eos_reason"] == "max_new_tokens":
+                    print("\n" + col_error + f" !! Response exceeded {max_response_tokens} tokens and was cut short." + col_default)
+        if not response.endswith("\n"):
+            print()
+
+        # Add response to context
+        context[-1] = (user_prompt, response.strip())
+
+
+if __name__ == "__main__":
+    parser = argparse.ArgumentParser()
+    model_init.add_args(parser, cache = True)
+    parser.add_argument("-mode", "--mode", type = str, help = "Prompt mode", required = True)
+    parser.add_argument("-modes", "--modes", action = "store_true", help = "List available prompt modes and exit")
+    parser.add_argument("-un", "--user_name", type = str, default = "User", help = "User name (raw mode only)")
+    parser.add_argument("-bn", "--bot_name", type = str, default = "Assistant", help = "Bot name (raw mode only)")
+    parser.add_argument("-mli", "--mli", action = "store_true", help = "Enable multi line input")
+    parser.add_argument("-sp", "--system_prompt", type = str, help = "Use custom system prompt")
+    parser.add_argument("-maxr", "--max_response_tokens", type = int, default = 1000, help = "Max tokens per response, default = 1000")
+    # TODO: Sampling options
+    _args = parser.parse_args()
+    main(_args)
--- a/examples/chat_templates.py
+++ b/examples/chat_templates.py
@@ -0,0 +1,209 @@
+
+class PromptFormat:
+    def __init__(self, user_name, bot_name):
+        self.user_name = user_name
+        self.bot_name = bot_name
+    def default_system_prompt(self):
+        raise NotImplementedError()
+    def format(self, system_prompt, messages):
+        raise NotImplementedError()
+    def add_bos(self):
+        raise NotImplementedError()
+
+
+class PromptFormat_raw(PromptFormat):
+    description = "Model-agnostic mode simulating a raw chatlog"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return (
+            f"This is a conversation between a helpful AI assistant " +
+            (f"named {self.bot_name} " if self.bot_name != "Assistant" else "") +
+            (f"and a user named {self.user_name}." if self.user_name != "User" else """and a user.""")
+        )
+
+    def format(self, system_prompt, messages):
+        context = system_prompt + "\n"
+        for (u, a) in messages:
+            context += f"{self.user_name}: {u}\n"
+            context += f"{self.bot_name}:"
+            if a is not None:
+                context += f"{a}\n"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            self.user_name + ":",
+            self.user_name[0:1] + ":",
+            self.user_name.upper() + ":",
+            self.user_name.lower() + ":",
+            tokenizer.eos_token_id
+        ]
+
+
+class PromptFormat_llama3(PromptFormat):
+    description = "Llama3-instruct models"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return (
+            """Assist users with tasks and answer questions to the best of your knowledge. Provide helpful and informative """
+            """responses. Be conversational and engaging. If you are unsure or lack knowledge on a topic, admit it and try """
+            """to find the answer or suggest where to find it. Keep responses concise and relevant. Follow ethical """
+            """guidelines and promote a safe and respectful interaction."""
+        )
+
+    def format(self, system_prompt, messages):
+        context = f"<|start_header_id|>system<|end_header_id|>\n\n{system_prompt}<|eot_id|>"
+        for (u, a) in messages:
+            context += f"<|start_header_id|>user<|end_header_id|>\n\n{u}<|eot_id|>"
+            context += f"<|start_header_id|>assistant<|end_header_id|>\n\n"
+            if a is not None: context += f"{a}<|eot_id|>"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            tokenizer.eos_token_id,
+            tokenizer.single_id("<|eot_id|>"),
+            tokenizer.single_id("<|start_header_id|>")
+        ]
+
+
+class PromptFormat_chatml(PromptFormat):
+    description = "ChatML format, as used by e.g. Qwen"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return (
+            f"You are {self.bot_name}, a large language model. Answer as concisely as possible."
+        )
+
+    def format(self, system_prompt, messages):
+        context = f"<|im_start|>system\n{system_prompt}<|im_end|>\n"
+        for (u, a) in messages:
+            context += f"<|im_start|>user\n{u}<|im_end|>\n"
+            context += f"<|im_start|>assistant\n"
+            if a is not None: context += f"{a}<|im_end|>\n"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            tokenizer.eos_token_id,
+            tokenizer.single_id("<|im_end|>"),
+            """<|im_end|>"""
+        ]
+
+
+class PromptFormat_phi(PromptFormat):
+    description = "Phi3/Phi4 instruct models"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return (
+            f"You are a helpful AI assistant."
+        )
+
+    def format(self, system_prompt, messages):
+        context = f"<|system|>\n{system_prompt}<|end|>\n"
+        for (u, a) in messages:
+            context += f"<|user|>\n{u}<|end|>\n"
+            context += f"<|assistant|>\n"
+            if a is not None: context += f"{a}<|end|>\n"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            tokenizer.eos_token_id,
+            tokenizer.single_id("<|end|>"),
+        ]
+
+
+class PromptFormat_mistral(PromptFormat):
+    description = "Mistral-instruct models (v3)"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return (
+            """You are a helpful AI assistant."""
+        )
+
+    def format(self, system_prompt, messages):
+        context = ""
+        first = True
+        for (u, a) in messages:
+            if first:
+                context += f"[INST] {system_prompt}\n\n{u}[/INST]"
+                first = False
+            else:
+                context += f"[INST] {u}[/INST]"
+            if a is not None: context += f" {a}</s>"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            tokenizer.eos_token_id
+        ]
+
+
+class PromptFormat_gemma(PromptFormat):
+    description = "Gemma"
+
+    def __init__(self, *args):
+        super().__init__(*args)
+
+    def default_system_prompt(self):
+        return ""
+
+    def format(self, system_prompt, messages):
+        context = ""
+        for (u, a) in messages:
+            context += f"<start_of_turn>user\n"
+            context += f"{u}<end_of_turn>\n"
+            context += f"<start_of_turn>model\n"
+            if a is not None: context += f"{a}<end_of_turn>\n"
+        return context
+
+    def add_bos(self):
+        return True
+
+    def stop_conditions(self, tokenizer):
+        return [
+            tokenizer.eos_token_id,
+            tokenizer.single_id("<end_of_turn>"),
+            tokenizer.single_id("<start_of_turn>"),
+        ]
+
+
+prompt_formats = {
+    "raw": PromptFormat_raw,
+    "llama3": PromptFormat_llama3,
+    "chatml": PromptFormat_chatml,
+    "phi": PromptFormat_phi,
+    "mistral": PromptFormat_mistral,
+    "gemma": PromptFormat_gemma,
+}
--- a/examples/common.py
+++ b/examples/common.py
@@ -0,0 +1,67 @@
+
+"""
+Quick and dirty and probably not very accurate prompt templates for a couple of models
+"""
+
+def format_prompt(prompt_format, sp, p):
+
+    match prompt_format:
+
+        case "llama":
+            return f"<s>[INST] <<SYS>>\n{sp}\n<</SYS>>\n\n{p} [/INST]"
+
+        case "llama3":
+            return (
+            f"<|begin_of_text|>"
+            f"<|start_header_id|>system<|end_header_id|>\n\n"
+            f"{sp}<|eot_id|>"
+            f"<|start_header_id|>user<|end_header_id|>\n\n"
+            f"{p}<|eot_id|>"
+            f"<|start_header_id|>assistant<|end_header_id|>\n\n"
+        )
+
+        case "mistral":
+            return f"<s>[INST] {sp}\n\n n{p}[/INST]"
+
+        case "granite":
+            return (
+                f"System:\n"
+                f"{sp}\n\n"
+                f"Question:\n"
+                f"{p}\n\n"
+                f"Answer:\n"
+            )
+
+        case "chatml":
+            return (
+                f"<|im_start|>system\n"
+                f"{sp}<|im_end|>\n"
+                f"<|im_start|>user\n"
+                f"{p}<|im_end|>\n"
+                f"<|im_start|>assistant\n"
+            )
+
+        case "gemma":
+            return (
+                f"<bos><start_of_turn>user\n"
+                f"{p}<end_of_turn>\n"
+                f"<start_of_turn>model\n"
+            )
+
+        case _:
+            raise ValueError("Unknown prompt format")
+
+
+def get_stop_conditions(prompt_format, tokenizer):
+
+    match prompt_format:
+        case "llama":
+            return [tokenizer.eos_token_id]
+        case "llama3":
+            return [tokenizer.single_id("<|eot_id|>")]
+        case "granite":
+            return [tokenizer.eos_token_id, "\n\nQuestion:"]
+        case "gemma":
+            return [tokenizer.eos_token_id, "<end_of_turn>"]
+        case "chatml":
+            return [tokenizer.eos_token_id, "<|im_end|>"]
--- a/examples/dynamic_gen.py
+++ b/examples/dynamic_gen.py
@@ -0,0 +1,248 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav3 import Model, Config, Cache, Tokenizer, Generator, Job, Sampler
+from exllamav3.util import Timer
+from blessed import Terminal
+from common import format_prompt, get_stop_conditions
+import pprint
+
+"""
+This is a demo and small showcase some of the features of the dynamic batching generator
+
+Display modes for this demo:
+1: One line per job, updated continuously
+2: Print completions as jobs finish
+3: Step over output iteration by iteration
+4: Space heater mode (no output)
+"""
+display_mode = 1
+
+# Where to find our model
+model_dir = "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/4.0bpw/"
+
+# Total number of tokens to allocate space for in the cache.
+total_context = 32768
+
+# Max number of batches to run at once, assuming the sequences will fit within total_context.
+max_batch_size = 16
+
+# Max chunk size. Determines the size of prefill operations. Can be reduced to reduce pauses whenever a
+# new job is started, but at the expense of overall prompt ingestion speed.
+max_chunk_size = 2048
+
+# Max new tokens per completion. For this example applies to all jobs.
+max_new_tokens = 500
+
+# Some prompts to feed the generator
+prompt_format = "llama3"
+system_prompt = "You are an AI assistant"
+prompts = [
+    "What is 2+2 and why?",
+    "Can you guess the next number in this sequence: " + ", ".join(str(n) for n in range(500)),
+    "Can you guess the next number in this sequence: " + ", ".join(str(n) for n in range(400)),
+    "Can you guess the next number in this sequence: " + ", ".join(str(n) for n in range(200)),
+    "Can you write a C++ quicksort implementation pretty please?",
+    "Hello!",
+    "Hi there!",
+    "What's the difference smoke and vapor?",
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 42 else 111) for n in range(200)),
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 42 else 111) for n in range(200)),
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 42 else 111) for n in range(200)),
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 42 else 111) for n in range(200)),
+    "Please guess the next 20 numbers in this sequence: " + ", ".join(str(n) for n in range(700)),
+    "Write a short essay about cell membranes.",
+    "What's up?",
+    "How do I open a can of beans?",
+    "How do I open a can of soup?",
+    "How do I open a can of strawberry jam?",
+    "How do I open a can of raspberry jam?",
+    "What's the tallest building in Paris?",
+    "What's the most populous nation on Earth?",
+    "What's the most populous nation on Mars?",
+    "What do the Mole People actually want and how can we best appease them?",
+    "Why is the sky blue?",
+    "Where is Waldo?",
+    "Who is Waldo?",
+    "Why is Waldo?",
+    "Is it legal to base jump off the Eiffel Tower?",
+    "Is it legal to base jump into a volcano?",
+    "Why are cats better than dogs?",
+    "Why is the Hulk so angry all the time?",
+    "How do I build a time machine?",
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 123 else 69) for n in range(200)),
+    "Is it legal to grow your own catnip?",
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 160 else 420) for n in range(400)),
+    "What seems out of place in this sequence: " + ", ".join(str(n if n != 161 else 421) for n in range(400)),
+    "What's inside a black hole?",
+    "What do the numbers 2, 4, 8, 16, 32 and 64 have in common?",
+    "What do the numbers 2, 3, 5, 7, 11 and 13 have in common?",
+    "Is there life on Mars?",
+    "Hello!",
+    "Hi!",
+    "Boop!",
+    "Why are cats better than dogs?",
+    "Why are cats better than dogs?",
+    "Why are cats better than dogs?",
+    "Write a parable about why cats are better than dogs.",
+]
+
+term = Terminal()
+
+def main():
+
+    # Load the model config
+    config = Config.from_directory("/mnt/str/eval_models/llama3.1-8b-instruct/exl3/2.0bpw/")
+
+    # Create the model from the config
+    model = Model.from_config(config)
+
+    # Create the cache before loading the model, so cache tensors are accounted for in the split
+    cache = Cache(model, max_num_tokens = total_context)
+
+    # Finally load the model. The default mode is autosplit.
+    model.load()
+
+    # Load the tokenizer
+    print("Loading tokenizer...")
+    tokenizer = Tokenizer.from_config(config)
+
+    # Initialize the generator
+    generator = Generator(
+        model = model,
+        cache = cache,
+        tokenizer = tokenizer,
+        max_batch_size = max_batch_size,
+        max_chunk_size = max_chunk_size,
+    )
+
+    # Create jobs
+    jobs = []
+    for prompt in prompts:
+        fprompt =  format_prompt(prompt_format, system_prompt, prompt)
+        input_ids = tokenizer.encode(fprompt, encode_special_tokens = True)
+        job = Job(
+            input_ids = input_ids,
+            max_new_tokens = max_new_tokens,
+            stop_conditions = get_stop_conditions(prompt_format, tokenizer)
+        )
+        jobs.append(job)
+
+    # Enqueue all the jobs at once
+    generator.enqueue(jobs)
+
+    # Go
+    match display_mode:
+
+        # Mode 1
+        case 1:
+            class JobStatusDisplay:
+                def __init__(self, job, console_line):
+                    self.console_line = console_line
+                    self.job = job
+                    self.prefill = 0
+                    self.max_prefill = 0
+                    self.collected_output = ""
+                    self.tokens = 0
+                    self.spaces = " " * 80
+                    text = term.darkgray(f"{self.console_line:3}:")
+                    text += term.blue("enqueued")
+                    print(term.move_xy(0, self.console_line) + text)
+
+                def update(self, r):
+                    stage = r["stage"]
+                    stage = r.get("eos_reason", stage)
+                    self.collected_output += r.get("text", "").replace("\n", "\\n")
+                    token_ids = r.get("token_ids", None)
+                    if token_ids is not None: self.tokens += token_ids.shape[-1]
+                    self.prefill = r.get("curr_progress", self.prefill)
+                    self.max_prefill = r.get("max_progress", self.max_prefill)
+                    text = term.darkgray(f"{self.console_line:3}:")
+                    text += term.blue(f"{stage:16}")
+                    text += "prefill [ " + term.yellow(f"{self.prefill: 5} / {self.max_prefill: 5}") + " ]"
+                    text += "   "
+                    text += term.green(f"{self.tokens: 5} t")
+                    text += term.darkgray(" -> ")
+                    text += (self.spaces + self.collected_output)[-80:].replace("\t", " ")
+                    if "accepted_draft_tokens" in r:
+                        acc = r["accepted_draft_tokens"]
+                        rej = r["rejected_draft_tokens"]
+                        eff = acc / (acc + rej) * 100.0
+                        text += term.bright_magenta(f"   SD eff.: {eff:6.2f}%")
+                    print(term.move_xy(0, self.console_line) + text)
+
+            print(term.enter_fullscreen())
+            displays = { job: JobStatusDisplay(job, line) for line, job in enumerate(jobs) }
+            while generator.num_remaining_jobs():
+                results = generator.iterate()
+                for r in results:
+                    job = r["job"]
+                    displays[job].update(r)
+            print(term.move_xy(0, len(displays) + 1) + "Press any key to continue...")
+            with term.cbreak():
+                term.inkey()
+
+        # Mode 2
+        case 2:
+            total_tokens = 0
+            total_time = 0
+            while generator.num_remaining_jobs():
+                with Timer() as t:
+                    results = generator.iterate()
+                total_time += t.interval
+                for r in results:
+                    if r["stage"] == "streaming" and not r["eos"]:
+                        total_tokens += r["token_ids"].shape[-1]
+                for r in results:
+                    if r["stage"] == "streaming" and r["eos"]:
+                        job = r["job"]
+                        in_prompt = \
+                        tokenizer.decode(job.sequences[0].input_ids.torch(), decode_special_tokens = True)[0]
+                        print("\n")
+                        print(term.darkgray("Input: "))
+                        print(term.yellow(in_prompt))
+                        print()
+                        print(term.darkgray("Output:"))
+                        print(r["full_completion"])
+                        print()
+                        print(term.darkgray("New tokens:        ") + term.green(f"{r['new_tokens']:9} t"))
+                        print(term.darkgray("Cached tokens:     ") + term.green(
+                            f"{r['cached_tokens']:7} t / {r['prompt_tokens']:7} t"))
+                        print(term.darkgray("Enqueued:          ") + term.blue(f"{r['time_enqueued']:9.2f} s"))
+                        print(term.darkgray("Prefill:           ") + term.blue(f"{r['time_prefill']:9.2f} s"))
+                        print(term.darkgray("Generation:        ") + term.blue(f"{r['new_tokens']:9.2f} s"))
+                        speed_input = r['prompt_tokens'] / (r['time_prefill'] + 1e-10)
+                        speed_output = r['new_tokens'] / (r['time_generate'] + 1e-10)
+                        speed_total = total_tokens / total_time
+                        print(term.darkgray("Job input          ") + term.cyan(f"{speed_input:9.2f} t/s"))
+                        print(term.darkgray("Job output         ") + term.cyan(f"{speed_output:9.2f} t/s"))
+                        print(term.darkgray("Overall output     ") + term.cyan(f"{speed_total:9.2f} t/s"))
+                        if "accepted_draft_tokens" in r:
+                            acc = r["accepted_draft_tokens"]
+                            rej = r["rejected_draft_tokens"]
+                            eff = acc / (acc + rej) * 100.0
+                            print(term.darkgray("SD efficiency:     ") + term.bright_magenta(f"{eff:9.2f}%"))
+
+        # Mode 3
+        case 3:
+            while generator.num_remaining_jobs():
+                results = generator.iterate()
+                print()
+                pprint.pprint(results, indent = 4)
+                print()
+                print("Press any key to continue...")
+                with term.cbreak():
+                    term.inkey()
+
+        case 4:
+            while generator.num_remaining_jobs():
+                generator.iterate()
+
+
+if __name__ == "__main__":
+    try:
+        main()
+    finally:
+        pass
+        if display_mode == 1:
+            print(term.exit_fullscreen())
--- a/examples/generation_loop.py
+++ b/examples/generation_loop.py
@@ -0,0 +1,91 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+
+from exllamav3 import Config, Model, Cache, Tokenizer, DefaultSampler
+from exllamav3.util import Timer
+from common import format_prompt, get_stop_conditions
+import torch
+
+"""
+This script demonstrates a minimal, cached generation pipeline, starting with tokenization of a prompt, prefill
+and then token-by-token sampling from logits produced by iterative forward passes through the model. For most
+applications the built-in generator offers more flexibility, though. 
+"""
+
+# Load model
+config = Config.from_directory("/mnt/str/eval_models/llama3.1-8b-instruct/exl3/2.0bpw/")
+model = Model.from_config(config)
+cache = Cache(model, max_num_tokens = 2048)
+model.load()
+
+# Load tokenizer
+tokenizer = Tokenizer.from_config(config)
+
+# Prepare inputs
+prompt_format = "llama3"
+prompt_text = format_prompt(
+    prompt_format,
+    "You are a super helpful language model.",
+    "List five ways in which cats are superior to dogs."
+)
+context_ids = tokenizer.encode(prompt_text, encode_special_tokens = True)
+
+# Sampling and stop conditions
+sampler = DefaultSampler()
+stop_conditions = get_stop_conditions(prompt_format, tokenizer)
+
+# Get model vocabulary as a list of strings, for streaming the completion
+vocab = tokenizer.get_id_to_piece_list()
+
+# Prefill the prompt, up to but not including the last token, which will be the first token forwarded in the
+# generation loop. Treat the cache as a rectangular batch
+model.prefill(
+    input_ids = context_ids[:, :-1],
+    params = {
+        "attn_mode": "flash_attn",
+        "cache": cache,
+        "past_len": 0,
+        "batch_shape": (1, 2048),
+    }
+)
+
+# Generation loop
+max_new_tokens = 500
+generated_tokens = 0
+response = ""
+
+torch.cuda.synchronize()
+with Timer() as t:
+    while generated_tokens < max_new_tokens:
+
+        # Get logits for current position
+        logits = model.forward(
+            input_ids = context_ids[:, -1:],
+            params = {
+                "attn_mode": "flash_attn",
+                "cache": cache,
+                "past_len": context_ids.shape[-1] - 1,
+                "batch_shape": (1, 2048),
+            }
+        )
+
+        # Sample from logits
+        sample = sampler.forward(logits, tokenizer = tokenizer)
+        token_id = sample.item()
+
+        # Detect end of stream
+        if token_id in stop_conditions:
+            break
+
+        # Append sampled token to context
+        context_ids = torch.cat((context_ids, sample), dim = -1)
+        token = vocab[token_id]
+        response += token
+        generated_tokens += 1
+
+        # Stream to the console
+        print(token, end = "", flush = True)
+
+print()
+print("---")
+print(f"{generated_tokens} tokens at {generated_tokens/t.interval:.3f} tokens/second")
--- a/examples/generator.py
+++ b/examples/generator.py
@@ -0,0 +1,186 @@
+import sys, os
+sys.path.append(os.path.dirname(os.path.dirname(os.path.abspath(__file__))))
+from exllamav3 import Config, Model, Cache, Tokenizer, Generator, Job, TopPSampler
+from common import format_prompt, get_stop_conditions
+
+"""
+A couple of examples showing uses of the generator
+"""
+
+prompt_format = "llama3"  # see common.py
+model_dir = "/mnt/str/eval_models/llama3.1-8b-instruct/exl3/4.0bpw/"
+cache_size = 16384
+
+system_prompt = "You are a very nice language model."
+
+instructions = [
+    "Write a short story beginning with the words 'Once in a while, when you least expect it'.",
+    "Why are cats so awesome?",
+    "Who was the tallest president of the United States?",
+    "Why are there so many different kinds of screws?",
+    "oinnvdoehwemnascnawwui8dh2",
+    "Write a haiku about catnip."
+]
+
+# Generate a single completion to a single prompt
+def generate_single(generator, tokenizer):
+    instruction = instructions[0]
+    print("------------------")
+    print("Prompt: " + instruction)
+    print()
+    response = generator.generate(
+        prompt = format_prompt(prompt_format, system_prompt, instruction),
+        stop_conditions = get_stop_conditions(prompt_format, tokenizer),
+        max_new_tokens = 500,
+        completion_only = True,
+        add_bos = True
+    )
+    print("Response: " + response)
+    print()
+
+
+# Generate multiple batched completions
+def generate_batched(generator, tokenizer):
+    print("------------------")
+    responses = generator.generate(
+        prompt = [format_prompt(prompt_format, system_prompt, instruction) for instruction in instructions],
+        stop_conditions = get_stop_conditions(prompt_format, tokenizer),
+        max_new_tokens = 100,
+        completion_only = True,
+        add_bos = True
+    )
+    for idx, response in enumerate(responses):
+        print(f"#{idx + 1}: {response}")
+        print("------------------")
+
+
+# Create a job and generate a stream of tokens
+def generate_streaming(generator, tokenizer):
+    instruction = instructions[0]
+    print("------------------")
+    print("Prompt: " + instruction)
+    print()
+    print("Response: ", end = "", flush = True)
+
+    # Create the job and enqueue it
+    formatted_prompt = format_prompt(prompt_format, system_prompt, instruction)
+    job = Job(
+        input_ids = tokenizer.encode(formatted_prompt, add_bos = True),
+        max_new_tokens = 400,
+        stop_conditions = get_stop_conditions(prompt_format, tokenizer),
+    )
+    generator.enqueue(job)
+
+    # Keep iterating until the generator has no more jobs
+    while generator.num_remaining_jobs():
+        results = generator.iterate()
+
+        # Each iteration returns a list of results, each of which may contain output tokens for a running job. We
+        # only care about the "text" field here.
+        for result in results:
+            text = result.get("text", "")
+            print(text, end = "", flush = True)
+
+    print()
+
+
+# Create a batch of jobs and stream the results
+def generate_streaming_batched(generator, tokenizer):
+
+    # Some buffers for collecting results
+    responses = [""] * len(instructions)
+
+    for idx, instruction in enumerate(instructions):
+
+        # Only print the second job to the console
+        if idx == 1:
+            print("------------------")
+            print("Prompt: " + instruction)
+            print()
+            print("Streamed response: ", end = "", flush = True)
+
+        # Create each job and enqueue it. Since one iteration of the generator can return multiple results, adding
+        # an identifier argument lets us track which sequence each chunk of output pertains to. The identifier can
+        # be any object, but a simple index will work here
+        formatted_prompt = format_prompt(prompt_format, system_prompt, instruction)
+        job = Job(
+            input_ids = tokenizer.encode(formatted_prompt, add_bos = True),
+            max_new_tokens = 400,
+            stop_conditions = get_stop_conditions(prompt_format, tokenizer),
+            identifier = idx,
+        )
+        generator.enqueue(job)
+
+    # Keep iterating until the generator has no more jobs
+    while generator.num_remaining_jobs():
+        results = generator.iterate()
+
+        for result in results:
+            text = result.get("text", "")
+            idx = result["identifier"]
+
+            # If this result is from the first job, stream to the console
+            if idx == 1:
+                print(text, end = "", flush = True)
+
+            # Collect results
+            responses[idx] += text
+
+    print()
+    print("--------------")
+
+    # Finally print all the collected results
+    for idx, response in enumerate(responses):
+        print(f"#{idx + 1}: {response}")
+        print("------------------")
+
+
+# Generate a series of completions with increasing temperature
+def generate_temperature(generator, tokenizer):
+    instruction = instructions[5]
+    print("------------------")
+    print("Prompt: " + instruction)
+    print()
+    temperature = 0.0
+    while temperature <= 3.01:
+        print(f"Temperature = {temperature:.2f}: ", end = "", flush = True)
+        response = generator.generate(
+            prompt = format_prompt(prompt_format, system_prompt, instruction),
+            stop_conditions = get_stop_conditions(prompt_format, tokenizer),
+            sampler = TopPSampler(temperature = temperature, top_p = 0.95, temperature_last = True),
+            max_new_tokens = 100,
+            completion_only = True,
+            add_bos = True
+        )
+        print(response)
+        print()
+        temperature += 0.25
+
+
+def main():
+
+    # Load a model with cache
+    config = Config.from_directory(model_dir)
+    model = Model.from_config(config)
+    cache = Cache(model, max_num_tokens = cache_size)
+    model.load(progressbar = True)
+    tokenizer = Tokenizer.from_config(config)
+
+    # Create generator
+    generator = Generator(
+        model = model,
+        cache = cache,
+        tokenizer = tokenizer,
+    )
+
+    # Do some things
+    generate_single(generator, tokenizer)
+    generate_batched(generator, tokenizer)
+    generate_streaming(generator, tokenizer)
+    generate_streaming_batched(generator, tokenizer)
+    generate_temperature(generator, tokenizer)
+
+
+if __name__ == "__main__":
+    main()
+
--- a/exllamav3/init.py
+++ b/exllamav3/init.py
@@ -0,0 +1,6 @@
+from .models.config import Config
+from .models.model import Model
+from .tokenizer import Tokenizer
+from .cache import Cache, CacheLayer_fp16
+from .generator import Generator, Job, AsyncGenerator, AsyncJob
+from .generator.sampler import *
--- a/exllamav3/cache/init.py
+++ b/exllamav3/cache/init.py
@@ -0,0 +1,2 @@
+from .cache import Cache, CacheLayer
+from .fp16 import CacheLayer_fp16
--- a/exllamav3/cache/cache.py
+++ b/exllamav3/cache/cache.py
@@ -0,0 +1,119 @@
+from __future__ import annotations
+from abc import ABC, abstractmethod
+from typing import Type
+import torch
+import torch.nn.functional as F
+from torch import nn
+from ..models import Model, Config
+
+class CacheLayer(ABC):
+
+    def __init__(
+        self,
+        config: Config,
+        max_num_tokens: int,
+    ):
+        self.config = config
+        self.max_num_tokens = max_num_tokens
+
+    @abstractmethod
+    def alloc(self, device: torch.device):
+        pass
+
+    @abstractmethod
+    def free(self):
+        pass
+
+    @abstractmethod
+    def get_kv(self):
+        pass
+
+    @abstractmethod
+    def copy_page(self, source: CacheLayer, from_page: int, to_page: int, num_tokens: int):
+        pass
+
+
+class Cache:
+
+    def __init__(
+        self,
+        model: Model,
+        max_num_tokens: int,
+        layer_type: Type[CacheLayer] | None = None,
+    ):
+        """
+        Create cache for model
+
+        :param model:
+            Model for which to create the cache. Once created, the cache is tied to the model. Loading the model
+            will create cache tensors and unloading the model will destroy them. To delete the cache itself without
+            deleting the reference to the model, use detach_from_model
+
+        :param layer_type:
+            Cache layer class, one of CacheLayer_fp16, CacheLayer_q4, CacheLayer_q6 etc.
+
+        :param max_num_tokens:
+            Max number of total tokens in the cache. Must be a multiple of the page size (256). For use with the
+            dynamic generator, this is the total number of tokens that can be allocated across concurrent jobs. For
+            batched inference, seq_len * batch_size <= max_num_tokens
+        """
+        self.model = model
+        self.config = model.config
+        self.max_num_tokens = max_num_tokens
+
+        from .fp16 import CacheLayer_fp16
+        self.layer_type = layer_type or CacheLayer_fp16
+
+        self.num_layers = self.config.num_hidden_layers
+        self.layers = [
+            self.layer_type(
+                self.config,
+                self.max_num_tokens,
+            ) for _ in range(self.num_layers)
+        ]
+        self.attach_to_model()
+
+
+    def attach_to_model(self, model: Model | None = None):
+        """
+        Attach cache to model. Registering the cache with the model (done automatically by the Cache constructor)
+        is necessary in order to tie loading of the model to allocation of cache tensors. Multiple caches can be
+        attached to the same model.
+        """
+        if model is None:
+            model = self.model
+        assert model.config.num_hidden_layers == self.num_layers, \
+            f"Cannot attach cache with {self.num_layers} layers to model with {model.config.num_hidden_layers} layers."
+        for layer, module in zip(self.layers, (m for m in model if m.caps.get("kv_cache"))):
+            assert layer not in module.cache_layers, \
+                "Cannot attach cache twice to the same model."
+            module.cache_layers.append(layer)
+
+
+    def detach_from_model(self, model: Model | None = None):
+        """
+        Detach cache from model. Must be called if you want to delete a cache without deleting the model.
+        """
+        if model is None:
+            model = self.model
+        assert model.config.num_hidden_layers == self.num_layers, \
+            f"Cannot detach cache with {self.num_layers} layers from model with {model.config.num_hidden_layers} layers."
+        for layer, module in zip(self.layers, (m for m in model if m.caps.get("kv_cache"))):
+            module.cache_layers.remove(layer)
+
+
+    def get_layer(self, idx: int) -> tuple:
+        return self.layers[idx].get_kv()
+
+
+    def copy_page(
+        self,
+        target: Cache,
+        from_page: int,
+        to_page: int,
+        num_tokens: int,
+    ):
+        assert target.num_layers == self.num_layers
+        for src, dst in zip(target.layers, self.layers):
+            assert type(src) is type(dst)
+            dst.copy_page(src, from_page, to_page, num_tokens)
--- a/exllamav3/cache/fp16.py
+++ b/exllamav3/cache/fp16.py
@@ -0,0 +1,54 @@
+from __future__ import annotations
+from typing_extensions import override
+import torch
+import torch.nn.functional as F
+from torch import nn
+from ..constants import PAGE_SIZE
+from ..models import Model, Config
+from .cache import CacheLayer
+
+class CacheLayer_fp16(CacheLayer):
+
+    def __init__(
+        self,
+        config: Config,
+        max_num_tokens: int,
+    ):
+        super().__init__(config, max_num_tokens)
+
+        assert max_num_tokens % PAGE_SIZE == 0, \
+            f"max_num_tokens must be a multiple of {PAGE_SIZE}."
+
+        self.shape = (max_num_tokens // PAGE_SIZE, PAGE_SIZE, config.num_kv_heads, config.head_dim)
+        self.k = None
+        self.v = None
+        self.device = None
+
+
+    @override
+    def alloc(self, device: torch.device):
+        self.device = device
+        self.k = torch.zeros(self.shape, dtype = torch.half, device = device)
+        self.v = torch.zeros(self.shape, dtype = torch.half, device = device)
+
+
+    @override
+    def free(self):
+        self.device = None
+        self.k = None
+        self.v = None
+
+
+    @override
+    def get_kv(self):
+        return self.k, self.v
+
+
+    @override
+    def copy_page(self, source: CacheLayer_fp16, from_page: int, to_page: int, num_tokens: int):
+        kd = self.k[to_page, :num_tokens, :, :]
+        vd = self.v[to_page, :num_tokens, :, :]
+        ks = source.k[from_page, :num_tokens, :, :]
+        vs = source.v[from_page, :num_tokens, :, :]
+        kd.copy_(ks, non_blocking = True)
+        vd.copy_(vs, non_blocking = True)
--- a/exllamav3/constants.py
+++ b/exllamav3/constants.py
@@ -0,0 +1 @@
+PAGE_SIZE = 256
--- a/exllamav3/conversion/init.py
+++ b/exllamav3/conversion/init.py
--- a/exllamav3/conversion/allocation.py
+++ b/exllamav3/conversion/allocation.py
@@ -0,0 +1,97 @@
+from __future__ import annotations
+import math
+import bisect
+from functools import lru_cache
+from typing import TYPE_CHECKING
+if TYPE_CHECKING:
+    from ..modules.linear import Linear
+
+def allocate_transformer(
+    bpw: float,
+    surplus_bits: int,
+    q: Linear,
+    k: Linear,
+    v: Linear,
+    o: Linear,
+    g: Linear,
+    u: Linear,
+    d: Linear,
+) -> (dict, int):
+
+    # Submodules
+    keys = [
+        q.key,
+        k.key,
+        v.key,
+        o.key,
+        g.key,
+        u.key,
+        d.key,
+    ]
+    numels = [
+        q.weights_numel(),
+        k.weights_numel(),
+        v.weights_numel(),
+        o.weights_numel(),
+        g.weights_numel(),
+        u.weights_numel(),
+        d.weights_numel(),
+    ]
+    numel = sum(numels)
+
+    # Bits per weight from budget
+    budget = int(bpw * numel) + surplus_bits + 1
+    bpw = budget / numel
+
+    # Permutations to consider
+    @lru_cache
+    def get_perms(base):
+        perms_qkvo = [
+            [0, 0, 0, 0],
+            [0, 0, 1, 0],
+            [0, 0, 2, 0],
+            [0, 1, 1, 1],
+            [0, 1, 2, 1],
+            [1, 2, 2, 1],
+        ]
+        perms_gud = [
+            [0, 0, 0],
+            [0, 0, 1],
+            [0, 1, 1],
+            [1, 1, 1],
+        ]
+        p = [qkvo + gud for qkvo in perms_qkvo for gud in perms_gud]
+        p = [[min(8, p1 + base_bpw) for p1 in p2] for p2 in p]
+        return p
+
+    base_bpw = max(int(math.floor(bpw)), 1)
+    perms = get_perms(base_bpw)
+
+    # Find largest option within budget
+    options = [(sum(a * b for a, b in zip(p, numels)), p) for p in perms]
+    options.sort()
+    idx = bisect.bisect_right(options, (budget,))
+    idx = max(0, idx - 1)
+    used_budget, selected = options[idx]
+
+    # Output
+    strategy = {k: v for k, v in zip(keys, selected)}
+    surplus = budget - used_budget
+    return strategy, surplus
+
+
+def allocate_linear(
+    bpw: float,
+    surplus_bits: int,
+    l: Linear,
+) -> (dict, int):
+
+    numel = l.weights_numel()
+    budget = int(bpw * numel) + surplus_bits + 1
+    bpw = budget / numel
+    bpw = max(int(math.floor(bpw)), 1)
+    used_budget = bpw * numel
+
+    strategy = {l.key: bpw}
+    surplus = budget - used_budget
+    return strategy, surplus
--- a/exllamav3/conversion/calibration_data.py
+++ b/exllamav3/conversion/calibration_data.py
@@ -0,0 +1,92 @@
+import torch
+import os
+import random
+
+def split_art(articles, rows, columns, tokenizer):
+    t_rows = []
+    idx = 0
+    empty = torch.empty((1, 0), dtype = torch.long)
+    t_row = empty
+    while len(t_rows) < rows:
+        add_special_tokens = (len(t_rows) % 2 == 0)
+        t_art = tokenizer.encode(articles[idx], add_bos = add_special_tokens, add_eos = add_special_tokens)
+        t_row = torch.cat((t_row, t_art), dim = -1)
+        t_row = t_row[:, :columns]
+        if t_row.shape[-1] == columns:
+            t_rows.append(t_row)
+            t_row = empty
+        idx += 1
+    return t_rows
+
+
+def split_wiki(text, rows, columns, tokenizer):
+    articles = [a[a.find("\n") + 1:] for a in text.split("</doc>\n")]
+    articles = [a for a in articles if len(a) > 50]
+    return split_art(articles, rows, columns, tokenizer)
+
+
+def split_tiny(text, rows, columns, tokenizer):
+    articles = [a.strip() for a in text.split("<|endoftext|>")]
+    return split_art(articles, rows, columns, tokenizer)
+
+
+def shuffle_lines(text, rows, columns, tokenizer):
+    articles = text.split("\n")
+    articles = [a for a in articles if not a.isspace()]
+    random.seed(0)
+    random.shuffle(articles)
+    return split_art(articles, rows, columns, tokenizer)
+
+
+def split_raw(text, rows, columns, tokenizer):
+    t_all = tokenizer.encode(text)
+    t_rows = []
+    for i in range(rows):
+        a = i * columns
+        b = a + columns
+        t_rows.append(t_all[:, a:b])
+    return t_rows
+
+
+def random_data(text, rows, columns, tokenizer):
+    vocab_size = tokenizer.actual_vocab_size
+    torch.manual_seed(0)
+    t_rows = []
+    for i in range(rows):
+        t_row = torch.randint(0, vocab_size, (1, columns), dtype = torch.long)
+        t_rows.append(t_row)
+    return t_rows
+
+
+def get_default_calibration(args, tokenizer):
+    columns = args["cal_cols"]
+    rows = args["cal_rows"]
+
+    data_dir = os.path.join(os.path.dirname(os.path.abspath(__file__)), "standard_cal_data")
+    files = [
+        ("c4.utf8", 10, shuffle_lines),
+        ("code.utf8", 15, split_raw),
+        ("multilingual.utf8", 15, shuffle_lines),
+        ("technical.utf8", 10, split_raw),
+        ("wiki.utf8", 48, split_wiki),
+        ("tiny.utf8", 10, split_tiny),
+        (None, 20, random_data),
+    ]
+
+    dist_sum = sum(x for (_, x, _) in files)
+    cal_data = []
+
+    for filename, weight, processor in files:
+        target_rows = max(1, int(weight / dist_sum * rows))
+        if filename:
+            path = os.path.join(data_dir, filename)
+            with open(path, "r", encoding = "utf8") as f:
+                file_text = f.read()
+        else:
+            file_text = None
+            target_rows = max(1, rows - len(cal_data))
+        r = processor(file_text, target_rows, columns, tokenizer)
+        cal_data += r
+
+    # cal_data = torch.cat(cal_data, dim = 0)
+    return cal_data
--- a/exllamav3/conversion/compile.py
+++ b/exllamav3/conversion/compile.py
@@ -0,0 +1,137 @@
+import os
+import shutil
+import json
+from ..loader.safetensors import SafetensorsCollection
+from ..version import __version__
+from safetensors.torch import save_file
+from ..util.memory import free_mem
+
+def tsize(t):
+    return t.nelement() * t.element_size()
+
+def dsize(d):
+    size = 0
+    for _, v in d.items(): size += tsize(v)
+    return size
+
+def compile_model(args, model, config, tokenizer):
+
+    in_dir = args["in_dir"]
+    out_dir = args["out_dir"]
+    work_dir = args["work_dir"]
+    qtensors_dir = os.path.join(work_dir, "qtensors")
+    qtensors_stc = SafetensorsCollection(qtensors_dir)
+
+    # Prepare output directory
+    if not os.path.exists(out_dir):
+        print(f" -- Creating directory {out_dir}")
+        os.makedirs(out_dir)
+    else:
+        print(f" -- Writing into {out_dir}")
+        if len(os.listdir(out_dir)) != 0:
+            print(f" !! Warning, output directory is not empty")
+
+    # Allocate shards
+    total_size = 0
+    max_shard_bytes = args["shard_size"] * 1024**2
+    out_map = []
+    out_map.append([])
+    current_shard_size = 0
+    for module in model.modules:
+        prefix = module.key
+        sizes = qtensors_stc.get_tensor_sizes(prefix)
+        if len(sizes) == 0:
+            continue
+        size = sum(sizes)
+        if size > max_shard_bytes:
+            print(f" !! Warning, unable to fit module {module.key} in single shard of {args['shard_size']} MB")
+        if current_shard_size + size > max_shard_bytes and current_shard_size > 0:
+            current_shard_size = 0
+            out_map.append([])
+        current_shard_size += size
+        total_size += size
+        out_map[-1].append(module)
+
+    # Write model tensors
+    map_dict = {}
+    num_files = len(out_map)
+    for file_idx, modules in enumerate(out_map):
+        filename = (
+            "model.safetensors" if num_files == 1 else
+            f"model-{file_idx+1:05}-of-{num_files:05}.safetensors"
+        )
+        print(f" -- Writing {filename}")
+        file_dict = {}
+        for module in modules:
+            prefix = module.key
+            tensors = qtensors_stc.get_tensors(prefix, allow_bf16 = True)
+            tensors = {k: v.contiguous() for k, v in tensors.items()}
+            file_dict.update(tensors)
+        for name in file_dict.keys():
+            map_dict[name] = filename
+        save_file(file_dict, os.path.join(out_dir, filename))
+        del file_dict
+        free_mem()
+
+    # Copy non-tensor files
+    print(f" -- Copying non-tensor files from {in_dir}")
+    filtered_files = []
+    ignored_files = []
+    for f in os.listdir(in_dir):
+        if not os.path.isfile(os.path.join(in_dir, f)):
+            continue
+        if f.endswith(".safetensors"):
+            continue
+        if f == "config.json":
+            continue
+        if f == "model.safetensors.index.json":
+            continue
+        if any(f.endswith(x) for x in [".bin", ".ckpt", ".pth", ".pt"]):
+            ignored_files.append(f)
+            continue
+        filtered_files.append(f)
+    if ignored_files:
+        print(f" !! Warning, the following file(s) will not be included in output model:")
+        for f in ignored_files[:10]:
+            print(f"     - {f}")
+        if len(ignored_files) > 10:
+            print(f"     - (+ {len(ignored_files) - 10} more)")
+    for f in filtered_files:
+        print(f"     - {f}")
+        source_file_path = os.path.join(in_dir, f)
+        target_file_path = os.path.join(out_dir, f)
+        shutil.copy(source_file_path, target_file_path)
+
+    # Write new model.safetensors.index.json maybe
+    if num_files > 1:
+        print(f" -- Writing model.safetensors.index.json")
+        safetensors_index = {
+            "metadata": {
+                "total_size": total_size,
+            },
+            "weight_map": map_dict
+        }
+        with open(os.path.join(out_dir, "model.safetensors.index.json"), "w") as f:
+            f.write(json.dumps(safetensors_index, indent = 4))
+
+    # Update and write config.json
+    print(f" -- Writing config.json")
+    with open(os.path.join(in_dir, "config.json"), "r") as f:
+        config_dict = json.load(f)
+    qcfg = {
+        "quant_method": "exl3",
+        "version": __version__,
+        "bits": args["bits"],
+        # "head_bits": args["head_bits"],
+        "calibration": {
+            "rows": args["cal_rows"],
+            "cols": args["cal_cols"],
+        }
+    }
+    config_dict["quantization_config"] = qcfg
+    with open(os.path.join(out_dir, "config.json"), "w") as f:
+        f.write(json.dumps(config_dict, indent = 4))
+
+    print(f" -- Finished compiling model to {out_dir}")
+
+
--- a/exllamav3/conversion/convert_model.py
+++ b/exllamav3/conversion/convert_model.py
@@ -0,0 +1,356 @@
+import argparse
+import torch
+import time
+import sys
+from .. import Config, Model, Tokenizer
+from ..modules import Linear
+from ..modules.quant import LinearFP16
+from ..util.progress import ProgressBar
+from ..util.memory import free_mem
+from ..util import Timer
+from .calibration_data import get_default_calibration
+from .compile import compile_model, dsize
+from safetensors.torch import save_file
+from safetensors import safe_open
+import os, shutil
+import json
+
+torch.set_printoptions(precision = 5, sci_mode = False, linewidth = 200)
+
+parser = argparse.ArgumentParser()
+parser.add_argument("-i", "--in_dir", type = str, default = None, help = "Input (model) directory")
+parser.add_argument("-w", "--work_dir", type = str, default = None, help = "Working directory")
+parser.add_argument("-o", "--out_dir", type = str, default = None, help = "Output directory")
+parser.add_argument("-ss", "--shard_size", type = int, help = "Max shard size in MB, default: 8192")
+parser.add_argument("-b", "--bits", type = float, help = "Bits per weight")
+parser.add_argument("-hb", "--head_bits", type = int, default = None, help = "Bits per weight, output (head) layer, default: 6")
+parser.add_argument("-resume", "--resume", action = "store_true", help = "Resume interrupted job from working directory")
+parser.add_argument("-cr", "--cal_rows", type = int, help = "Calibration data size, rows, default: 100")
+parser.add_argument("-cc", "--cal_cols", type = int, help = "Calibration data size, columns, default: 2048")
+parser.add_argument("-cpi", "--checkpoint_interval", type = int, default = 60, help = "Minimum checkpoint interval, in seconds")
+parser.add_argument("-lcpi", "--last_checkpoint_index", type = int, default = None, help = "Last module index to checkpoint (for debug purposes)")
+parser.add_argument("-v", "--verbose", action = "store_true", help = "Verbose mode")
+
+num_ref_states = 5
+
+def save_dict(filename, dict_, args):
+    path = os.path.join(args["work_dir"], filename)
+    with open(path, "w", encoding = "utf8") as f:
+        f.write(json.dumps(dict_, indent = 4))
+
+
+def load_dict(filename, args):
+    path = os.path.join(args["work_dir"], filename)
+    with open(path, "r", encoding = "utf8") as f:
+        return json.load(f)
+
+
+def load_tensor(filename, args):
+    path = os.path.join(args["work_dir"], filename)
+    with safe_open(path, framework = "pt", device = "cpu") as f:
+        if "tensor" in f.keys():
+            return f.get_tensor("tensor")
+        else:
+            tensors = []
+            i = 0
+            while f"tensor.{i}" in f.keys():
+                tensors.append(f.get_tensor(f"tensor.{i}"))
+                i += 1
+            return tensors
+
+
+def save_tensor(tensor, filename: str, args):
+    path = os.path.join(args["work_dir"], filename)
+    if isinstance(tensor, dict):
+        save_file({
+            k: v for k, v in tensor.items()
+        }, path)
+    elif isinstance(tensor, list):
+        save_file({
+            f"tensor.{i}": t for i, t in enumerate(tensor)
+        }, path)
+    else:
+        save_file({
+            f"tensor": tensor
+        }, path)
+
+
+def prepare_env(args):
+    qtensors_dir = os.path.join(args["work_dir"], "qtensors")
+    ckpt_dir = os.path.join(args["work_dir"], "ckpt")
+    os.makedirs(args["work_dir"], exist_ok = True)
+    os.makedirs(qtensors_dir, exist_ok = True)
+    os.makedirs(ckpt_dir, exist_ok = True)
+
+
+def prepare(args) -> (dict, bool, str, str):
+    if not args.work_dir:
+        return None, None, False, "Must specify --work_dir"
+    if not args.in_dir and not args.resume:
+        return None, None, False, "Specify either --in_dir to start a new job or --resume to resume an interrupted job"
+    if not args.out_dir and not args.resume:
+        return None, None, False, "Must specify --out_dir or --resume"
+
+    in_args = { "work_dir": args.work_dir }
+    if args.resume:
+        in_args = load_dict("args.json", in_args)
+        in_args["work_dir"] = args.work_dir
+
+    prepare_env(in_args)
+
+    def override(arg, can_override, default):
+        if (arg not in args or vars(args)[arg] is None) and arg not in in_args:
+            if default is not None:
+                in_args[arg] = default
+            else:
+                raise ValueError(f" ## Missing required argument: {arg}")
+        if arg in args and vars(args)[arg] is not None:
+            if arg in in_args and vars(args)[arg] and in_args[arg] != vars(args)[arg]:
+                if can_override:
+                    print(
+                        f" !! Warning: Overriding {arg} from existing job, was: {in_args[arg]}, "
+                        f"new value: {vars(args)[arg]}"
+                    )
+                else:
+                    raise ValueError(
+                        f" ## Error: Resuming job with {arg} = {in_args[arg]}, "
+                        f"cannot override with new value of {vars(args)[arg]}. "
+                        f"Please start a new job to change this value."
+                    )
+            in_args[arg] = vars(args)[arg]
+
+    for arg_, can_override, default in [
+        ("in_dir", True, None),
+        ("out_dir", True, None),
+        ("shard_size", True, 8192),
+        ("bits", False, None),
+        ("head_bits", False, 6),
+        ("cal_rows", False, 100),
+        ("cal_cols", False, 2048),
+        ("checkpoint_interval", True, None),
+        ("last_checkpoint_index", True, -1),
+    ]:
+        override(arg_, can_override, default)
+
+    # Momentary args
+    in_args["verbose"] = args.verbose
+
+    if args.resume:
+        job_state = load_dict("ckpt/job.json", in_args)
+        print(f" -- Resuming existing job")
+    else:
+        print(f" -- Creating new job")
+        job_state = {
+            "next_module_idx": 0,
+            "surplus_bits": 0,
+        }
+        save_dict("args.json", in_args, in_args)
+        save_dict("ckpt/job.json", job_state, in_args)
+
+    print(f"    Input directory: {in_args['in_dir']}")
+    print(f"    Working directory: {in_args['work_dir']}")
+    print(f"    Output directory: {in_args['out_dir']}")
+    print(f"    Calibration size: {in_args['cal_rows']} rows, {in_args['cal_cols']} columns")
+    print(f"    Target bitrate: {in_args['bits']} (decoder), {in_args['head_bits']} (head)")
+
+    return in_args, job_state, True, None
+
+
+def get_base_model(args):
+    config = Config.from_directory(args["in_dir"])
+    print(f" -- Loaded model config")
+    print(f"    Architecture: {config.architecture}")
+    model = Model.from_config(config)
+    print(f" -- Created model instance:")
+    print(model.get_layout_tree(4))
+    tokenizer = Tokenizer.from_config(config)
+    print(f" -- Loaded tokenizer")
+    print(f"    Vocab size: {tokenizer.actual_vocab_size}")
+    return config, model, tokenizer
+
+
+def prepare_state(args, job_state, config, model, tokenizer):
+    idx = job_state["next_module_idx"]
+    if idx == 0:
+        print(f" -- Preparing input state")
+        state = get_default_calibration(args, tokenizer)
+    else:
+        if idx < len(model.modules):
+            print(f" -- Resuming at: {model.modules[idx].key}")
+        else:
+            print(f" -- Resuming after: {model.modules[idx - 1].key}")
+        state = load_tensor("ckpt/state.safetensors", args)
+    return state
+
+
+def get_state_error(x, ref):
+     x = x.view(-1, x.shape[-1]).float()
+     ref = ref.view(-1, ref.shape[-1]).float()
+     err = torch.linalg.norm(x - ref, 'fro') / torch.linalg.norm(ref, 'fro')
+     return err.item()
+
+
+@torch.inference_mode()
+def main(args, job_state):
+
+    torch.set_printoptions(precision = 5, sci_mode = False, linewidth = 200)
+
+    torch.set_grad_enabled(False)
+    device = torch.device("cuda:0")
+    last_checkpoint_time = time.time()
+
+    # Get model
+    config, model, tokenizer = get_base_model(args)
+
+    # Get initial state or resume state
+    state = prepare_state(args, job_state, config, model, tokenizer)
+
+    # Iterate over modules
+    for idx, module in enumerate(model.modules):
+
+        # If resuming, skip along to checkpoint index
+        if idx < job_state["next_module_idx"]:
+            continue
+
+        # Load current module
+        print(f" -- Loading unquantized module: {module.key}")
+        module.load(torch.device("cpu") if module.caps.get("prefer_cpu") else device)
+        for m in module:
+            if m.used_alt_key:
+                print(f"     - Cloned {m.key} from {m.alt_key}")
+
+        # Skip modules without quant targets
+        qmaps = module.get_qmaps()
+        if len(qmaps) > 0:
+
+            # Capture calibration input states during forward pass
+            with ProgressBar(f" -- Capturing: {module.key}", len(state)) as progress:
+                capture_H = {}
+                ref_states = []
+                for i in range(len(state)):
+                    progress.update(i)
+                    params = {
+                        "attn_mode": "flash_attn_nc",
+                        "capture": capture_H
+                    }
+                    rs = module.prepare_for_device(state[i], params)
+                    rs = module.forward(rs, params)
+                    if i < num_ref_states:
+                        ref_states.append(rs.cpu())
+                    rs = None
+            print(f" -- Captured: {module.key}")
+            sys.stdout.flush()
+
+            # Swap captured H to system RAM
+            for k, v in capture_H.items():
+                v["H_swap_device"] = v["H"].device
+                v["H"] = v["H"].cpu()
+
+        # Get submodules to quantize
+        linears = [m for m in module if isinstance(m, Linear) and m.qmap]
+
+        # Move original tensors to system RAM (load to GPU one by one when quantizing)
+        for linear in linears:
+            assert isinstance(linear.inner, LinearFP16)
+            linear.inner.swap_cpu()
+
+        # Quantization strategy
+        if linears:
+            strategy, surplus = module.allocate_q(
+                {
+                    "bits": args["bits"],
+                    "head_bits": args["head_bits"],
+                },
+                job_state["surplus_bits"],
+            )
+            job_state["surplus_bits"] = surplus
+            assert all(m.key in strategy for m in linears), \
+                f" ## Logic error, no quantization strategy for {m.key}"
+
+        # Quantize module
+        for linear in linears:
+            linear.inner.unswap_cpu()
+            quant_args = {
+                "seed": idx,
+                "K": strategy[linear.key],
+            }
+            with Timer() as t:
+                proxy_err = linear.convert_exl3(
+                    capture_H[linear.qmap],
+                    quant_args = quant_args,
+                    progress_str = f" -- <step>: {linear.key}",
+                    verbose = args["verbose"]
+                )
+            print(
+                f" -- Quantized: {linear.key:{config.stc.max_key_len()}}"
+                f"  bpw: {quant_args['K']:.2f}"
+                f"  proxy_err: {proxy_err:8.6f}"
+                f"  [{t.interval:4.2f} s]"
+            )
+            sys.stdout.flush()
+
+        # Save converted module tensors
+        tensors = {}
+        for m in module:
+            tensors.update(m.get_tensors())
+        qtensors_dir = os.path.join(args["work_dir"], "qtensors")
+        out_file = os.path.join(qtensors_dir, f"{module.key}.safetensors")
+        save_tensor(tensors, out_file, args)
+
+        # Output final bpw for layer
+        num_bytes = dsize(tensors)
+        num_bits = num_bytes * 8
+        final_bpw = num_bits / module.weights_numel()
+        print(
+            f" -- Quantized: {module.key:{config.stc.max_key_len()}}"
+            f"  bpw: {final_bpw:.2f}"
+        )
+
+        del tensors
+        free_mem()
+
+        # Advance state
+        error = 0
+        with ProgressBar(f" -- Forward pass: {module.key}", len(state)) as progress:
+            for i in range(len(state)):
+                progress.update(i)
+                params = {
+                    "attn_mode": "flash_attn_nc",
+                }
+                state[i] = module.prepare_for_device(state[i], params)
+                if i < num_ref_states or idx < len(model.modules) - 1:
+                    state[i] = module.forward(state[i], params).cpu()
+                if i < num_ref_states and len(linears):
+                    ref_states[i] = ref_states[i].to(state[i].device)
+                    error += get_state_error(state[i], ref_states[i])
+                    ref_states[i] = None
+        error /= num_ref_states
+        print(f" -- Finished module: {module.key}, rfn: {error:.6f}")
+        sys.stdout.flush()
+
+        # Unload current module
+        module.unload()
+        free_mem()
+
+        # Checkpoint
+        job_state["next_module_idx"] = idx + 1
+        if time.time() > last_checkpoint_time + args["checkpoint_interval"] and \
+            (args.get("last_checkpoint_index", -1) < 0 or idx <= args["last_checkpoint_index"]):
+            print(f" -- Saving checkpoint")
+            ckpt_dir = os.path.join(args["work_dir"], "ckpt")
+            ckpt_dir_old = os.path.join(args["work_dir"], "ckpt_old")
+            ckpt_dir_new = os.path.join(args["work_dir"], "ckpt_new")
+            os.makedirs(ckpt_dir_new, exist_ok = True)
+            save_dict("ckpt_new/job.json", job_state, args)
+            save_tensor(state, "ckpt_new/state.safetensors", args)
+            if os.path.exists(ckpt_dir_old):
+                shutil.rmtree(ckpt_dir_old)
+            os.rename(ckpt_dir, ckpt_dir_old)
+            os.rename(ckpt_dir_new, ckpt_dir)
+            last_checkpoint_time = time.time()
+
+    # Compile model
+    compile_model(args, model, config, tokenizer)
+
+    # All done
+    print(" -- All done")
--- a/exllamav3/conversion/standard_cal_data/c4.utf8
+++ b/exllamav3/conversion/standard_cal_data/c4.utf8
--- a/exllamav3/conversion/standard_cal_data/code.utf8
+++ b/exllamav3/conversion/standard_cal_data/code.utf8
--- a/exllamav3/conversion/standard_cal_data/multilingual.utf8
+++ b/exllamav3/conversion/standard_cal_data/multilingual.utf8
--- a/exllamav3/conversion/standard_cal_data/technical.utf8
+++ b/exllamav3/conversion/standard_cal_data/technical.utf8
@@ -0,0 +1,988 @@
+MIT License
+
+Permission is hereby granted, free of charge, to any person obtaining a copy
+of this software and associated documentation files (the "Software"), to deal
+in the Software without restriction, including without limitation the rights
+to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
+copies of the Software, and to permit persons to whom the Software is
+furnished to do so, subject to the following conditions:
+
+The above copyright notice and this permission notice shall be included in all
+copies or substantial portions of the Software.
+
+THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
+IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
+FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
+AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
+LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
+OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
+SOFTWARE.
+
+A Turing machine is a mathematical model of computation describing an abstract machine[1] that manipulates symbols on a strip of tape according to a table of rules.[2] Despite the model's simplicity, it is capable of implementing any computer algorithm.[3]
+
+The machine operates on an infinite[4] memory tape divided into discrete cells,[5] each of which can hold a single symbol drawn from a finite set of symbols called the alphabet of the machine. It has a "head" that, at any point in the machine's operation, is positioned over one of these cells, and a "state" selected from a finite set of states. At each step of its operation, the head reads the symbol in its cell. Then, based on the symbol and the machine's own present state, the machine writes a symbol into the same cell, and moves the head one step to the left or the right,[6] or halts the computation. The choice of which replacement symbol to write, which direction to move the head, and whether to halt is based on a finite table that specifies what to do for each combination of the current state and the symbol that is read. Like a real computer program, it is possible for a Turing machine to go into an infinite loop which will never halt.
+
+The Turing machine was invented in 1936 by Alan Turing,[7][8] who called it an "a-machine" (automatic machine).[9] It was Turing's doctoral advisor, Alonzo Church, who later coined the term "Turing machine" in a review.[10] With this model, Turing was able to answer two questions in the negative:
+
+    Does a machine exist that can determine whether any arbitrary machine on its tape is "circular" (e.g., freezes, or fails to continue its computational task)?
+    Does a machine exist that can determine whether any arbitrary machine on its tape ever prints a given symbol?[11][12]
+
+Thus by providing a mathematical description of a very simple device capable of arbitrary computations, he was able to prove properties of computation in general—and in particular, the uncomputability of the Entscheidungsproblem ('decision problem').[13]
+
+Turing machines proved the existence of fundamental limitations on the power of mechanical computation.[14] While they can express arbitrary computations, their minimalist design makes them too slow for computation in practice: real-world computers are based on different designs that, unlike Turing machines, use random-access memory.
+
+Turing completeness is the ability for a computational model or a system of instructions to simulate a Turing machine. A programming language that is Turing complete is theoretically capable of expressing all tasks accomplishable by computers; nearly all programming languages are Turing complete if the limitations of finite memory are ignored.
+Overview
+
+A Turing machine is an idealised model of a central processing unit (CPU) that controls all data manipulation done by a computer, with the canonical machine using sequential memory to store data. Typically, the sequential memory is represented as a tape of infinite length on which the machine can perform read and write operations.
+
+In the context of formal language theory, a Turing machine (automaton) is capable of enumerating some arbitrary subset of valid strings of an alphabet. A set of strings which can be enumerated in this manner is called a recursively enumerable language. The Turing machine can equivalently be defined as a model that recognises valid input strings, rather than enumerating output strings.
+
+Given a Turing machine M and an arbitrary string s, it is generally not possible to decide whether M will eventually produce s. This is due to the fact that the halting problem is unsolvable, which has major implications for the theoretical limits of computing.
+
+The Turing machine is capable of processing an unrestricted grammar, which further implies that it is capable of robustly evaluating first-order logic in an infinite number of ways. This is famously demonstrated through lambda calculus.
+
+A Turing machine that is able to simulate any other Turing machine is called a universal Turing machine (UTM, or simply a universal machine). Another mathematical formalism, lambda calculus, with a similar "universal" nature was introduced by Alonzo Church. Church's work intertwined with Turing's to form the basis for the Church–Turing thesis. This thesis states that Turing machines, lambda calculus, and other similar formalisms of computation do indeed capture the informal notion of effective methods in logic and mathematics and thus provide a model through which one can reason about an algorithm or "mechanical procedure" in a mathematically precise way without being tied to any particular formalism. Studying the abstract properties of Turing machines has yielded many insights into computer science, computability theory, and complexity theory.
+Physical description
+
+In his 1948 essay, "Intelligent Machinery", Turing wrote that his machine consisted of:
+
+    ...an unlimited memory capacity obtained in the form of an infinite tape marked out into squares, on each of which a symbol could be printed. At any moment there is one symbol in the machine; it is called the scanned symbol. The machine can alter the scanned symbol, and its behavior is in part determined by that symbol, but the symbols on the tape elsewhere do not affect the behavior of the machine. However, the tape can be moved back and forth through the machine, this being one of the elementary operations of the machine. Any symbol on the tape may therefore eventually have an innings.[15]
+    — Turing 1948, p. 3[16]
+
+Description
+For visualizations of Turing machines, see Turing machine gallery.
+
+The Turing machine mathematically models a machine that mechanically operates on a tape. On this tape are symbols, which the machine can read and write, one at a time, using a tape head. Operation is fully determined by a finite set of elementary instructions such as "in state 42, if the symbol seen is 0, write a 1; if the symbol seen is 1, change into state 17; in state 17, if the symbol seen is 0, write a 1 and change to state 6;" etc. In the original article ("On Computable Numbers, with an Application to the Entscheidungsproblem", see also references below), Turing imagines not a mechanism, but a person whom he calls the "computer", who executes these deterministic mechanical rules slavishly (or as Turing puts it, "in a desultory manner").
+The head is always over a particular square of the tape; only a finite stretch of squares is shown. The instruction to be performed (q4) is shown over the scanned square. (Drawing after Kleene (1952) p. 375.)
+Here, the internal state (q1) is shown inside the head, and the illustration describes the tape as being infinite and pre-filled with "0", the symbol serving as blank. The system's full state (its "complete configuration") consists of the internal state, any non-blank symbols on the tape (in this illustration "11B"), and the position of the head relative to those symbols including blanks, i.e. "011B". (Drawing after Minsky (1967) p. 121.)
+
+More explicitly, a Turing machine consists of:
+
+    A tape divided into cells, one next to the other. Each cell contains a symbol from some finite alphabet. The alphabet contains a special blank symbol (here written as '0') and one or more other symbols. The tape is assumed to be arbitrarily extendable to the left and to the right, so that the Turing machine is always supplied with as much tape as it needs for its computation. Cells that have not been written before are assumed to be filled with the blank symbol. In some models the tape has a left end marked with a special symbol; the tape extends or is indefinitely extensible to the right.
+    A head that can read and write symbols on the tape and move the tape left and right one (and only one) cell at a time. In some models the head moves and the tape is stationary.
+    A state register that stores the state of the Turing machine, one of finitely many. Among these is the special start state with which the state register is initialised. These states, writes Turing, replace the "state of mind" a person performing computations would ordinarily be in.
+    A finite table[17] of instructions[18] that, given the state(qi) the machine is currently in and the symbol(aj) it is reading on the tape (symbol currently under the head), tells the machine to do the following in sequence (for the 5-tuple models):
+
+    Either erase or write a symbol (replacing aj with aj1).
+    Move the head (which is described by dk and can have values: 'L' for one step left or 'R' for one step right or 'N' for staying in the same place).
+    Assume the same or a new state as prescribed (go to state qi1).
+
+In the 4-tuple models, erasing or writing a symbol (aj1) and moving the head left or right (dk) are specified as separate instructions. The table tells the machine to (ia) erase or write a symbol or (ib) move the head left or right, and then (ii) assume the same or a new state as prescribed, but not both actions (ia) and (ib) in the same instruction. In some models, if there is no entry in the table for the current combination of symbol and state, then the machine will halt; other models require all entries to be filled.
+
+Every part of the machine (i.e. its state, symbol-collections, and used tape at any given time) and its actions (such as printing, erasing and tape motion) is finite, discrete and distinguishable; it is the unlimited amount of tape and runtime that gives it an unbounded amount of storage space.
+Formal definition
+
+Following Hopcroft & Ullman (1979, p. 148), a (one-tape) Turing machine can be formally defined as a 7-tuple M = ⟨ Q , Γ , b , Σ , δ , q 0 , F ⟩ M=\langle Q,\Gamma ,b,\Sigma ,\delta ,q_{0},F\rangle where
+
+    Γ \Gamma is a finite, non-empty set of tape alphabet symbols;
+    b ∈ Γ b\in \Gamma is the blank symbol (the only symbol allowed to occur on the tape infinitely often at any step during the computation);
+    Σ ⊆ Γ ∖ { b } \Sigma \subseteq \Gamma \setminus \{b\} is the set of input symbols, that is, the set of symbols allowed to appear in the initial tape contents;
+    Q Q is a finite, non-empty set of states;
+    q 0 ∈ Q q_{0}\in Q is the initial state;
+    F ⊆ Q F\subseteq Q is the set of final states or accepting states. The initial tape contents is said to be accepted by M M if it eventually halts in a state from F F.
+    δ : ( Q ∖ F ) × Γ ↛ Q × Γ × { L , R } {\displaystyle \delta :(Q\setminus F)\times \Gamma \not \to Q\times \Gamma \times \{L,R\}} is a partial function called the transition function, where L is left shift, R is right shift. If δ \delta is not defined on the current state and the current tape symbol, then the machine halts;[19] intuitively, the transition function specifies the next state transited from the current state, which symbol to overwrite the current symbol pointed by the head, and the next head movement.
+
+3-state Busy Beaver. Black icons represent location and state of head; square colors represent 1s (orange) and 0s (white); time progresses vertically from the top until the HALT state at the bottom.
+
+A relatively uncommon variant allows "no shift", say N, as a third element of the set of directions { L , R } \{L,R\}.
+
+The 7-tuple for the 3-state busy beaver looks like this (see more about this busy beaver at Turing machine examples):
+
+    Q = { A , B , C , HALT } Q=\{{\mbox{A}},{\mbox{B}},{\mbox{C}},{\mbox{HALT}}\} (states);
+    Γ = { 0 , 1 } \Gamma =\{0,1\} (tape alphabet symbols);
+    b = 0 b=0 (blank symbol);
+    Σ = { 1 } \Sigma =\{1\} (input symbols);
+    q 0 = A q_{0}={\mbox{A}} (initial state);
+    F = { HALT } F=\{{\mbox{HALT}}\} (final states);
+    δ = \delta = see state-table below (transition function).
+
+Initially all tape cells are marked with 0 {\displaystyle 0}.
+State table for 3-state, 2-symbol busy beaver Tape symbol 	Current state A 	Current state B 	Current state C
+Write symbol 	Move tape 	Next state 	Write symbol 	Move tape 	Next state 	Write symbol 	Move tape 	Next state
+0 	1 	R 	B 	1 	L 	A 	1 	L 	B
+1 	1 	L 	C 	1 	R 	B 	1 	R 	HALT
+Additional details required to visualise or implement Turing machines
+
+In the words of van Emde Boas (1990), p. 6: "The set-theoretical object [his formal seven-tuple description similar to the above] provides only partial information on how the machine will behave and what its computations will look like."
+
+For instance,
+
+    There will need to be many decisions on what the symbols actually look like, and a failproof way of reading and writing symbols indefinitely.
+    The shift left and shift right operations may shift the tape head across the tape, but when actually building a Turing machine it is more practical to make the tape slide back and forth under the head instead.
+    The tape can be finite, and automatically extended with blanks as needed (which is closest to the mathematical definition), but it is more common to think of it as stretching infinitely at one or both ends and being pre-filled with blanks except on the explicitly given finite fragment the tape head is on. (This is, of course, not implementable in practice.) The tape cannot be fixed in length, since that would not correspond to the given definition and would seriously limit the range of computations the machine can perform to those of a linear bounded automaton if the tape was proportional to the input size, or finite-state machine if it was strictly fixed-length.
+
+TERMS AND CONDITIONS FOR USE, REPRODUCTION, AND DISTRIBUTION
+
+1. Definitions.
+
+"License" shall mean the terms and conditions for use, reproduction, and distribution as defined by Sections 1 through 9 of this document.
+
+"Licensor" shall mean the copyright owner or entity authorized by the copyright owner that is granting the License.
+
+"Legal Entity" shall mean the union of the acting entity and all other entities that control, are controlled by, or are under common control with that entity. For the purposes of this definition, "control" means (i) the power, direct or indirect, to cause the direction or management of such entity, whether by contract or otherwise, or (ii) ownership of fifty percent (50%) or more of the outstanding shares, or (iii) beneficial ownership of such entity.
+
+"You" (or "Your") shall mean an individual or Legal Entity exercising permissions granted by this License.
+
+"Source" form shall mean the preferred form for making modifications, including but not limited to software source code, documentation source, and configuration files.
+
+"Object" form shall mean any form resulting from mechanical transformation or translation of a Source form, including but not limited to compiled object code, generated documentation, and conversions to other media types.
+
+"Work" shall mean the work of authorship, whether in Source or Object form, made available under the License, as indicated by a copyright notice that is included in or attached to the work (an example is provided in the Appendix below).
+
+"Derivative Works" shall mean any work, whether in Source or Object form, that is based on (or derived from) the Work and for which the editorial revisions, annotations, elaborations, or other modifications represent, as a whole, an original work of authorship. For the purposes of this License, Derivative Works shall not include works that remain separable from, or merely link (or bind by name) to the interfaces of, the Work and Derivative Works thereof.
+
+"Contribution" shall mean any work of authorship, including the original version of the Work and any modifications or additions to that Work or Derivative Works thereof, that is intentionally submitted to Licensor for inclusion in the Work by the copyright owner or by an individual or Legal Entity authorized to submit on behalf of the copyright owner. For the purposes of this definition, "submitted" means any form of electronic, verbal, or written communication sent to the Licensor or its representatives, including but not limited to communication on electronic mailing lists, source code control systems, and issue tracking systems that are managed by, or on behalf of, the Licensor for the purpose of discussing and improving the Work, but excluding communication that is conspicuously marked or otherwise designated in writing by the copyright owner as "Not a Contribution."
+
+"Contributor" shall mean Licensor and any individual or Legal Entity on behalf of whom a Contribution has been received by Licensor and subsequently incorporated within the Work.
+
+2. Grant of Copyright License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable copyright license to reproduce, prepare Derivative Works of, publicly display, publicly perform, sublicense, and distribute the Work and such Derivative Works in Source or Object form.
+
+3. Grant of Patent License. Subject to the terms and conditions of this License, each Contributor hereby grants to You a perpetual, worldwide, non-exclusive, no-charge, royalty-free, irrevocable (except as stated in this section) patent license to make, have made, use, offer to sell, sell, import, and otherwise transfer the Work, where such license applies only to those patent claims licensable by such Contributor that are necessarily infringed by their Contribution(s) alone or by combination of their Contribution(s) with the Work to which such Contribution(s) was submitted. If You institute patent litigation against any entity (including a cross-claim or counterclaim in a lawsuit) alleging that the Work or a Contribution incorporated within the Work constitutes direct or contributory patent infringement, then any patent licenses granted to You under this License for that Work shall terminate as of the date such litigation is filed.
+
+4. Redistribution. You may reproduce and distribute copies of the Work or Derivative Works thereof in any medium, with or without modifications, and in Source or Object form, provided that You meet the following conditions:
+
+    You must give any other recipients of the Work or Derivative Works a copy of this License; and
+    You must cause any modified files to carry prominent notices stating that You changed the files; and
+    You must retain, in the Source form of any Derivative Works that You distribute, all copyright, patent, trademark, and attribution notices from the Source form of the Work, excluding those notices that do not pertain to any part of the Derivative Works; and
+    If the Work includes a "NOTICE" text file as part of its distribution, then any Derivative Works that You distribute must include a readable copy of the attribution notices contained within such NOTICE file, excluding those notices that do not pertain to any part of the Derivative Works, in at least one of the following places: within a NOTICE text file distributed as part of the Derivative Works; within the Source form or documentation, if provided along with the Derivative Works; or, within a display generated by the Derivative Works, if and wherever such third-party notices normally appear. The contents of the NOTICE file are for informational purposes only and do not modify the License. You may add Your own attribution notices within Derivative Works that You distribute, alongside or as an addendum to the NOTICE text from the Work, provided that such additional attribution notices cannot be construed as modifying the License.
+
+You may add Your own copyright statement to Your modifications and may provide additional or different license terms and conditions for use, reproduction, or distribution of Your modifications, or for any such Derivative Works as a whole, provided Your use, reproduction, and distribution of the Work otherwise complies with the conditions stated in this License.
+
+5. Submission of Contributions. Unless You explicitly state otherwise, any Contribution intentionally submitted for inclusion in the Work by You to the Licensor shall be under the terms and conditions of this License, without any additional terms or conditions. Notwithstanding the above, nothing herein shall supersede or modify the terms of any separate license agreement you may have executed with Licensor regarding such Contributions.
+
+6. Trademarks. This License does not grant permission to use the trade names, trademarks, service marks, or product names of the Licensor, except as required for reasonable and customary use in describing the origin of the Work and reproducing the content of the NOTICE file.
+
+7. Disclaimer of Warranty. Unless required by applicable law or agreed to in writing, Licensor provides the Work (and each Contributor provides its Contributions) on an "AS IS" BASIS, WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied, including, without limitation, any warranties or conditions of TITLE, NON-INFRINGEMENT, MERCHANTABILITY, or FITNESS FOR A PARTICULAR PURPOSE. You are solely responsible for determining the appropriateness of using or redistributing the Work and assume any risks associated with Your exercise of permissions under this License.
+
+8. Limitation of Liability. In no event and under no legal theory, whether in tort (including negligence), contract, or otherwise, unless required by applicable law (such as deliberate and grossly negligent acts) or agreed to in writing, shall any Contributor be liable to You for damages, including any direct, indirect, special, incidental, or consequential damages of any character arising as a result of this License or out of the use or inability to use the Work (including but not limited to damages for loss of goodwill, work stoppage, computer failure or malfunction, or any and all other commercial damages or losses), even if such Contributor has been advised of the possibility of such damages.
+
+9. Accepting Warranty or Additional Liability. While redistributing the Work or Derivative Works thereof, You may choose to offer, and charge a fee for, acceptance of support, warranty, indemnity, or other liability obligations and/or rights consistent with this License. However, in accepting such obligations, You may act only on Your own behalf and on Your sole responsibility, not on behalf of any other Contributor, and only if You agree to indemnify, defend, and hold each Contributor harmless for any liability incurred by, or claims asserted against, such Contributor by reason of your accepting any such warranty or additional liability.
+
+END OF TERMS AND CONDITIONS
+
+
+How to apply the Apache License to your work
+
+Include a copy of the Apache License, typically in a file called LICENSE, in your work, and consider also including a NOTICE file that references the License.
+
+To apply the Apache License to specific files in your work, attach the following boilerplate declaration, replacing the fields enclosed by brackets "[]" with your own identifying information. (Don't include the brackets!) Enclose the text in the appropriate comment syntax for the file format. We also recommend that you include a file or class name and description of purpose on the same "printed page" as the copyright notice for easier identification within third-party archives.
+
+Copyright [yyyy] [name of copyright owner]
+
+Licensed under the Apache License, Version 2.0 (the "License");
+you may not use this file except in compliance with the License.
+You may obtain a copy of the License at
+
+    http://www.apache.org/licenses/LICENSE-2.0
+
+Unless required by applicable law or agreed to in writing, software
+distributed under the License is distributed on an "AS IS" BASIS,
+WITHOUT WARRANTIES OR CONDITIONS OF ANY KIND, either express or implied.
+See the License for the specific language governing permissions and
+limitations under the License.
+
+
+Human TP53 gene
+
+In humans, a common polymorphism involves the substitution of an arginine for a proline at codon position 72 of exon 4. Many studies have investigated a genetic link between this variation and cancer susceptibility; however, the results have been controversial. For instance, a meta-analysis from 2009 failed to show a link for cervical cancer.[15] A 2011 study found that the TP53 proline mutation did have a profound effect on pancreatic cancer risk among males.[16] A study of Arab women found that proline homozygosity at TP53 codon 72 is associated with a decreased risk for breast cancer.[17] One study suggested that TP53 codon 72 polymorphisms, MDM2 SNP309, and A2164G may collectively be associated with non-oropharyngeal cancer susceptibility and that MDM2 SNP309 in combination with TP53 codon 72 may accelerate the development of non-oropharyngeal cancer in women.[18] A 2011 study found that TP53 codon 72 polymorphism was associated with an increased risk of lung cancer.[19]
+
+Meta-analyses from 2011 found no significant associations between TP53 codon 72 polymorphisms and both colorectal cancer risk[20] and endometrial cancer risk.[21] A 2011 study of a Brazilian birth cohort found an association between the non-mutant arginine TP53 and individuals without a family history of cancer.[22] Another 2011 study found that the p53 homozygous (Pro/Pro) genotype was associated with a significantly increased risk for renal cell carcinoma.[23]
+Function
+DNA damage and repair
+
+p53 plays a role in regulation or progression through the cell cycle, apoptosis, and genomic stability by means of several mechanisms:
+
+    It can activate DNA repair proteins when DNA has sustained damage. Thus, it may be an important factor in aging.[24]
+    It can arrest growth by holding the cell cycle at the G1/S regulation point on DNA damage recognition—if it holds the cell here for long enough, the DNA repair proteins will have time to fix the damage and the cell will be allowed to continue the cell cycle.
+    It can initiate apoptosis (i.e., programmed cell death) if DNA damage proves to be irreparable.
+    It is essential for the senescence response to short telomeres.
+
+p53 pathway: In a normal cell, p53 is inactivated by its negative regulator, mdm2. Upon DNA damage or other stresses, various pathways will lead to the dissociation of the p53 and mdm2 complex. Once activated, p53 will induce a cell cycle arrest to allow either repair and survival of the cell or apoptosis to discard the damaged cell. How p53 makes this choice is currently unknown.
+
+WAF1/CIP1 encoding for p21 and hundreds of other down-stream genes. p21 (WAF1) binds to the G1-S/CDK (CDK4/CDK6, CDK2, and CDK1) complexes (molecules important for the G1/S transition in the cell cycle) inhibiting their activity.
+
+When p21(WAF1) is complexed with CDK2, the cell cannot continue to the next stage of cell division. A mutant p53 will no longer bind DNA in an effective way, and, as a consequence, the p21 protein will not be available to act as the "stop signal" for cell division.[25] Studies of human embryonic stem cells (hESCs) commonly describe the nonfunctional p53-p21 axis of the G1/S checkpoint pathway with subsequent relevance for cell cycle regulation and the DNA damage response (DDR). Importantly, p21 mRNA is clearly present and upregulated after the DDR in hESCs, but p21 protein is not detectable. In this cell type, p53 activates numerous microRNAs (like miR-302a, miR-302b, miR-302c, and miR-302d) that directly inhibit the p21 expression in hESCs.
+
+The p21 protein binds directly to cyclin-CDK complexes that drive forward the cell cycle and inhibits their kinase activity, thereby causing cell cycle arrest to allow repair to take place. p21 can also mediate growth arrest associated with differentiation and a more permanent growth arrest associated with cellular senescence. The p21 gene contains several p53 response elements that mediate direct binding of the p53 protein, resulting in transcriptional activation of the gene encoding the p21 protein.
+
+The p53 and RB1 pathways are linked via p14ARF, raising the possibility that the pathways may regulate each other.[26]
+
+p53 expression can be stimulated by UV light, which also causes DNA damage. In this case, p53 can initiate events leading to tanning.[27][28]
+Stem cells
+
+Levels of p53 play an important role in the maintenance of stem cells throughout development and the rest of human life.
+
+In human embryonic stem cells (hESCs)s, p53 is maintained at low inactive levels.[29] This is because activation of p53 leads to rapid differentiation of hESCs.[30] Studies have shown that knocking out p53 delays differentiation and that adding p53 causes spontaneous differentiation, showing how p53 promotes differentiation of hESCs and plays a key role in cell cycle as a differentiation regulator. When p53 becomes stabilized and activated in hESCs, it increases p21 to establish a longer G1. This typically leads to abolition of S-phase entry, which stops the cell cycle in G1, leading to differentiation. Work in mouse embryonic stem cells has recently shown however that the expression of P53 does not necessarily lead to differentiation.[31] p53 also activates miR-34a and miR-145, which then repress the hESCs pluripotency factors, further instigating differentiation.[29]
+
+In adult stem cells, p53 regulation is important for maintenance of stemness in adult stem cell niches. Mechanical signals such as hypoxia affect levels of p53 in these niche cells through the hypoxia inducible factors, HIF-1α and HIF-2α. While HIF-1α stabilizes p53, HIF-2α suppresses it.[32] Suppression of p53 plays important roles in cancer stem cell phenotype, induced pluripotent stem cells and other stem cell roles and behaviors, such as blastema formation. Cells with decreased levels of p53 have been shown to reprogram into stem cells with a much greater efficiency than normal cells.[33][34] Papers suggest that the lack of cell cycle arrest and apoptosis gives more cells the chance to be reprogrammed. Decreased levels of p53 were also shown to be a crucial aspect of blastema formation in the legs of salamanders.[35] p53 regulation is very important in acting as a barrier between stem cells and a differentiated stem cell state, as well as a barrier between stem cells being functional and being cancerous.[36]
+
+Specifically, the term Galilean invariance today usually refers to this principle as applied to Newtonian mechanics, that is, Newton's laws of motion hold in all frames related to one another by a Galilean transformation. In other words, all frames related to one another by such a transformation are inertial (meaning, Newton's equation of motion is valid in these frames). In this context it is sometimes called Newtonian relativity.
+
+Among the axioms from Newton's theory are:
+
+    There exists an absolute space, in which Newton's laws are true. An inertial frame is a reference frame in relative uniform motion to absolute space.
+    All inertial frames share a universal time.
+
+Galilean relativity can be shown as follows. Consider two inertial frames S and S' . A physical event in S will have position coordinates r = (x, y, z) and time t in S, and r' = (x' , y' , z' ) and time t' in S' . By the second axiom above, one can synchronize the clock in the two frames and assume t = t' . Suppose S' is in relative uniform motion to S with velocity v. Consider a point object whose position is given by functions r' (t) in S' and r(t) in S. We see that
+
+    r ′ ( t ) = r ( t ) − v t . r'(t) = r(t) - v t.\,
+
+The velocity of the particle is given by the time derivative of the position:
+
+    u ′ ( t ) = d d t r ′ ( t ) = d d t r ( t ) − v = u ( t ) − v . u'(t) = \frac{d}{d t} r'(t) = \frac{d}{d t} r(t) - v = u(t) - v.
+
+Another differentiation gives the acceleration in the two frames:
+
+    a ′ ( t ) = d d t u ′ ( t ) = d d t u ( t ) − 0 = a ( t ) . a'(t) = \frac{d}{d t} u'(t) = \frac{d}{d t} u(t) - 0 = a(t).
+
+It is this simple but crucial result that implies Galilean relativity. Assuming that mass is invariant in all inertial frames, the above equation shows Newton's laws of mechanics, if valid in one frame, must hold for all frames.[1] But it is assumed to hold in absolute space, therefore Galilean relativity holds.
+
+Newton's theory versus special relativity
+
+A comparison can be made between Newtonian relativity and special relativity.
+
+Some of the assumptions and properties of Newton's theory are:
+
+    The existence of infinitely many inertial frames. Each frame is of infinite size (the entire universe may be covered by many linearly equivalent frames). Any two frames may be in relative uniform motion. (The relativistic nature of mechanics derived above shows that the absolute space assumption is not necessary.)
+    The inertial frames may move in all possible relative forms of uniform motion.
+    There is a universal, or absolute, notion of elapsed time.
+    Two inertial frames are related by a Galilean transformation.
+    In all inertial frames, Newton's laws, and gravity, hold.
+
+In comparison, the corresponding statements from special relativity are as follows:
+
+    The existence, as well, of infinitely many non-inertial frames, each of which referenced to (and physically determined by) a unique set of spacetime coordinates. Each frame may be of infinite size, but its definition is always determined locally by contextual physical conditions. Any two frames may be in relative non-uniform motion (as long as it is assumed that this condition of relative motion implies a relativistic dynamical effect – and later, mechanical effect in general relativity – between both frames).
+    Rather than freely allowing all conditions of relative uniform motion between frames of reference, the relative velocity between two inertial frames becomes bounded above by the speed of light.
+    Instead of universal elapsed time, each inertial frame possesses its own notion of elapsed time.
+    The Galilean transformations are replaced by Lorentz transformations.
+    In all inertial frames, all laws of physics are the same.
+
+Both theories assume the existence of inertial frames. In practice, the size of the frames in which they remain valid differ greatly, depending on gravitational tidal forces.
+
+In the appropriate context, a local Newtonian inertial frame, where Newton's theory remains a good model, extends to roughly 107 light years.
+
+In special relativity, one considers Einstein's cabins, cabins that fall freely in a gravitational field. According to Einstein's thought experiment, a man in such a cabin experiences (to a good approximation) no gravity and therefore the cabin is an approximate inertial frame. However, one has to assume that the size of the cabin is sufficiently small so that the gravitational field is approximately parallel in its interior. This can greatly reduce the sizes of such approximate frames, in comparison to Newtonian frames. For example, an artificial satellite orbiting the Earth can be viewed as a cabin. However, reasonably sensitive instruments could detect "microgravity" in such a situation because the "lines of force" of the Earth's gravitational field converge.
+
+In general, the convergence of gravitational fields in the universe dictates the scale at which one might consider such (local) inertial frames. For example, a spaceship falling into a black hole or neutron star would (at a certain distance) be subjected to tidal forces strong enough to crush it in width and tear it apart in length.[2] In comparison, however, such forces might only be uncomfortable for the astronauts inside (compressing their joints, making it difficult to extend their limbs in any direction perpendicular to the gravity field of the star). Reducing the scale further, the forces at that distance might have almost no effects at all on a mouse. This illustrates the idea that all freely falling frames are locally inertial (acceleration and gravity-free) if the scale is chosen correctly.[2]
+
+Poetry (a term derived from the Greek word poiesis, "making"), also called verse,[note 1] is a form of literature that uses aesthetic and often rhythmic[1][2][3] qualities of language − such as phonaesthetics, sound symbolism, and metre − to evoke meanings in addition to, or in place of, a prosaic ostensible meaning. A poem is a literary composition, written by a poet, using this principle.
+
+Poetry has a long and varied history, evolving differentially across the globe. It dates back at least to prehistoric times with hunting poetry in Africa and to panegyric and elegiac court poetry of the empires of the Nile, Niger, and Volta River valleys.[4] Some of the earliest written poetry in Africa occurs among the Pyramid Texts written during the 25th century BCE. The earliest surviving Western Asian epic poem, the Epic of Gilgamesh, was written in the Sumerian language.
+
+Early poems in the Eurasian continent evolved from folk songs such as the Chinese Shijing as well as from religious hymns (the Sanskrit Rigveda, the Zoroastrian Gathas, the Hurrian songs, and the Hebrew Psalms); or from a need to retell oral epics, as with the Egyptian Story of Sinuhe, Indian epic poetry, and the Homeric epics, the Iliad and the Odyssey.
+
+Ancient Greek attempts to define poetry, such as Aristotle's Poetics, focused on the uses of speech in rhetoric, drama, song, and comedy. Later attempts concentrated on features such as repetition, verse form, and rhyme, and emphasized the aesthetics which distinguish poetry from more objectively-informative prosaic writing.
+
+Poetry uses forms and conventions to suggest differential interpretations of words, or to evoke emotive responses. Devices such as assonance, alliteration, onomatopoeia, and rhythm may convey musical or incantatory effects. The use of ambiguity, symbolism, irony, and other stylistic elements of poetic diction often leaves a poem open to multiple interpretations. Similarly, figures of speech such as metaphor, simile, and metonymy[5] establish a resonance between otherwise disparate images—a layering of meanings, forming connections previously not perceived. Kindred forms of resonance may exist, between individual verses, in their patterns of rhyme or rhythm.
+
+Some poetry types are unique to particular cultures and genres and respond to characteristics of the language in which the poet writes. Readers accustomed to identifying poetry with Dante, Goethe, Mickiewicz, or Rumi may think of it as written in lines based on rhyme and regular meter. There are, however, traditions, such as Biblical poetry, that use other means to create rhythm and euphony. Much modern poetry reflects a critique of poetic tradition,[6] testing the principle of euphony itself or altogether forgoing rhyme or set rhythm.[7][8]
+
+Poets – as, from the Greek, "makers" of language – have contributed to the evolution of the linguistic, expressive, and utilitarian qualities of their languages. In an increasingly globalized world, poets often adapt forms, styles, and techniques from diverse cultures and languages.
+
+A Western cultural tradition (extending at least from Homer to Rilke) associates the production of poetry with inspiration – often by a Muse (either classical or contemporary), or through other (often canonised) poets' work which sets some kind of example or challenge.
+
+In first-person poems, the lyrics are spoken by an "I", a character who may be termed the speaker, distinct from the poet (the author). Thus if, for example, a poem asserts, "I killed my enemy in Reno", it is the speaker, not the poet, who is the killer (unless this "confession" is a form of metaphor which needs to be considered in closer context – via close reading).
+
+Early works
+
+Some scholars believe that the art of poetry may predate literacy, and developed from folk epics and other oral genres.[9][10] Others, however, suggest that poetry did not necessarily predate writing.[11]
+
+The oldest surviving epic poem, the Epic of Gilgamesh, dates from the 3rd millennium BCE in Sumer (in Mesopotamia, present-day Iraq), and was written in cuneiform script on clay tablets and, later, on papyrus.[12] The Istanbul tablet#2461, dating to c. 2000 BCE, describes an annual rite in which the king symbolically married and mated with the goddess Inanna to ensure fertility and prosperity; some have labelled it the world's oldest love poem.[13][14] An example of Egyptian epic poetry is The Story of Sinuhe (c. 1800 BCE).[15]
+
+Other ancient epics includes the Greek Iliad and the Odyssey; the Persian Avestan books (the Yasna); the Roman national epic, Virgil's Aeneid (written between 29 and 19 BCE); and the Indian epics, the Ramayana and the Mahabharata. Epic poetry appears to have been composed in poetic form as an aid to memorization and oral transmission in ancient societies.[11][16]
+
+Other forms of poetry, including such ancient collections of religious hymns as the Indian Sanskrit-language Rigveda, the Avestan Gathas, the Hurrian songs, and the Hebrew Psalms, possibly developed directly from folk songs. The earliest entries in the oldest extant collection of Chinese poetry, the Classic of Poetry (Shijing), were initially lyrics.[17] The Shijing, with its collection of poems and folk songs, was heavily valued by the philosopher Confucius and is considered to be one of the official Confucian classics. His remarks on the subject have become an invaluable source in ancient music theory.[18]
+
+The efforts of ancient thinkers to determine what makes poetry distinctive as a form, and what distinguishes good poetry from bad, resulted in "poetics"—the study of the aesthetics of poetry.[19] Some ancient societies, such as China's through the Shijing, developed canons of poetic works that had ritual as well as aesthetic importance.[20] More recently, thinkers have struggled to find a definition that could encompass formal differences as great as those between Chaucer's Canterbury Tales and Matsuo Bashō's Oku no Hosomichi, as well as differences in content spanning Tanakh religious poetry, love poetry, and rap.[21]
+
+Until recently, the earliest examples of stressed poetry had been thought to be works composed by Romanos the Melodist (fl. 6th century CE). However, Tim Whitmarsh writes that an inscribed Greek poem predated Romanos' stressed poetry. [22][23][24]
+
+"Figure 32.—Julius obtaining banana by using pole to climb up on and spring from. Figure 33.—Using pole to swing out on so that banana could be grasped. Figure 34.—Using stick to draw carrot within reach." From The mental life of monkeys and apes; a study of ideational behavior, by Robert Mearns Yerkes, 1916
+
+The monkey and banana problem is a famous toy problem in artificial intelligence, particularly in logic programming and planning.
+Formulation of the problem
+
+A monkey is in a room. Suspended from the ceiling is a bunch of bananas, beyond the monkey's reach. However, in the room there are also a chair and a stick. The ceiling is just the right height so that a monkey standing on a chair could knock the bananas down with the stick. The monkey knows how to move around, carry other things around, reach for the bananas, and wave a stick in the air. What is the best sequence of actions for the monkey?
+Purpose of the problem
+
+The problem seeks to answer the question of whether monkeys are intelligent. Both humans and monkeys have the ability to use mental maps to remember things like where to go to find shelter, or how to avoid danger. They can also remember where to go to gather food and water, as well as how to communicate with each other. Monkeys have the ability not only to remember how to hunt and gather but to learn new things, as is the case with the monkey and the bananas: despite the fact that the monkey may never have been in an identical situation, with the same artifacts at hand, a monkey is capable of concluding that it needs to make a ladder, position it below the bananas, and climb up to reach for them.
+
+The degree to which such abilities should be ascribed to instinct or learning is a matter of debate.
+
+In 1984, a pigeon was observed as having the capacity to solve a problem.[1][2]
+Software solutions
+
+The problem is used as a toy problem for computer science. It can be solved with an expert system such as CLIPS. The example set of rules that CLIPS provides is somewhat fragile in that naive changes to the rulebase that might seem to a human of average intelligence to make common sense can cause the engine to fail to get the monkey to reach the banana.[3]
+
+Other examples exist using Rules Based System (RBS) a project implemented in Python.[4][5]
+
+VBScript ("Microsoft Visual Basic Scripting Edition") is a deprecated Active Scripting language developed by Microsoft that is modeled on Visual Basic. It allows Microsoft Windows system administrators to generate powerful tools for managing computers without error handling and with subroutines and other advanced programming constructs. It can give the user complete control over many aspects of their computing environment.
+
+VBScript uses the Component Object Model to access elements of the environment within which it is running; for example, the FileSystemObject (FSO) is used to create, read, update and delete files. VBScript has been installed by default in every desktop release of Microsoft Windows since Windows 98;[1] in Windows Server since Windows NT 4.0 Option Pack;[2] and optionally with Windows CE (depending on the device it is installed on).
+
+A VBScript script must be executed within a host environment, of which there are several provided with Microsoft Windows, including: Windows Script Host (WSH), Internet Explorer (IE), and Internet Information Services (IIS).[3] Additionally, the VBScript hosting environment is embeddable in other programs, through technologies such as the Microsoft Script Control (msscript.ocx).
+History
+
+VBScript began as part of the Microsoft Windows Script Technologies, launched in 1996. This technology (which also included JScript) was initially targeted at web developers. During a period of just over two years, VBScript advanced from version 1.0 to 2.0, and over that time it gained support from Windows system administrators seeking an automation tool more powerful than the batch language first developed in the early 1980s.[4] On August 1, 1996, Internet Explorer was released with features that included VBScript.[5]
+
+In version 5.0, the functionality of VBScript was increased with new features including regular expressions; classes; the With statement;[6] the Eval, Execute, and ExecuteGlobal functions to evaluate and execute script commands built during the execution of another script; a function-pointer system via GetRef,[7] and Distributed COM (DCOM) support.
+
+In version 5.5, SubMatches[8] were added to the regular expression class in VBScript, to finally allow script authors to capture the text within the expression's groups. That capability had already been available in JScript.
+
+With the advent of the .NET Framework, the scripting team took the decision to implement future support for VBScript within ASP.NET for web development,[9] and therefore no new versions of the VBScript engine would be developed. It would henceforth be supported by Microsoft's Sustaining Engineering Team, who are responsible for bug fixes and security enhancements. For Windows system administrators, Microsoft suggests migrating to Windows PowerShell, as VBScript is deprecated and will eventually be removed from Windows.
+
+On October 9, 2023, Microsoft announced plans to deprecate and eventually remove VBScript from future Windows versions.[10]
+Environments
+
+When employed for client-side web development in Microsoft Internet Explorer, VBScript is similar in function to JavaScript. It is used to write executable functions that are embedded in or included from HTML pages and interact with the Document Object Model (DOM) of the page, to perform tasks not possible in HTML alone. However, other web browsers such as Firefox and Opera and recently Chrome do not have built-in support for VBScript. This means that where client-side scripting and cross-browser compatibility are required, developers usually choose JavaScript over most other programming languages, such as VBScript.
+
+VBScript is also used for server-side processing of web pages, most notably with Microsoft Active Server Pages (ASP). The ASP engine and type library, asp.dll, invokes vbscript.dll to run VBScript scripts. VBScript that is embedded in an ASP page is contained within <% and %> context switches. The following example of an ASP page with VBScript displays the current time in 24-hour format, but it is also used as the default script for error messages with the Windows operating systems.
+
+FAISS, including its GPU version (faiss-gpu), primarily works with NumPy arrays. This means that even when using FAISS with GPU support, the input data typically needs to be in the form of a NumPy array. However, this doesn't mean that you can't use PyTorch tensors with FAISS; it just means that you'll need to convert PyTorch tensors to NumPy arrays before using them with FAISS functions.
+
+The reason for this requirement is that FAISS internally uses NumPy arrays for its data structures and computations, and although PyTorch tensors and NumPy arrays are often interoperable, they are not the same thing.
+
+Here's what you typically need to do when working with PyTorch tensors and FAISS:
+
+    Convert PyTorch Tensors to NumPy Arrays:
+        If your data is in a PyTorch tensor, you can convert it to a NumPy array using the .numpy() method if the tensor is on the CPU.
+        If the tensor is on the GPU, you should first move it to the CPU with .cpu() and then convert it to a NumPy array.
+
+    Use FAISS for Operations:
+        Perform your necessary operations in FAISS using the NumPy arrays.
+
+    Convert Back to PyTorch Tensors if Needed:
+        If you need to bring the results back into PyTorch, you can convert the NumPy arrays back to PyTorch tensors.
+
+This conversion process is typically fast and efficient, but it's important to be aware of it, especially when working with large datasets or in situations where memory usage and computational efficiency are critical concerns.
+
+In summary, while faiss-gpu does not directly operate on PyTorch tensors, the conversion between PyTorch tensors and NumPy arrays is straightforward and usually doesn't pose a significant bottleneck in terms of performance.
+
+Resetting the chassis intrusion status on a Mellanox switch using the I2C (Inter-Integrated Circuit) interface involves a series of specific steps. Please note that this process is technical and should ideally be performed by someone with experience in managing network equipment and familiarity with the I2C protocol. Here's a general outline of how you might approach this task:
+
+    Access the I2C Interface: To interact with the I2C interface, you'll need to use a system that can communicate over I2C, such as a microcontroller or a computer with an I2C adapter. The exact method of accessing the I2C interface will depend on the equipment you are using.
+
+    Identify the Correct I2C Address: Each device on the I2C bus has a unique address. You need to find out the address of the chassis intrusion sensor or the relevant control register on the Mellanox switch. This information should be available in the switch's hardware documentation.
+
+    Read the Current Intrusion Status: Before resetting the intrusion status, it's often a good practice to read the current status. This can be done by sending a read command to the appropriate address and register.
+
+    Send a Reset Command: To reset the chassis intrusion status, you will need to write to a specific register of the intrusion detection system. This typically involves sending a particular byte or set of bytes to a specific address. The exact command will depend on the Mellanox switch model and its firmware.
+
+    Verify the Reset: After sending the reset command, you should read back the intrusion status to ensure that it has been successfully reset.
+
+    Log the Action: For security and auditing purposes, it's advisable to log this action in your system's records.
+
+It is crucial to consult the specific documentation for your Mellanox switch model for detailed instructions and the correct register addresses. Improperly interacting with the I2C interface can cause unintended behavior or damage to the switch.
+
+If you are not familiar with this level of hardware interaction, it is recommended to seek assistance from a qualified network engineer or contact Mellanox's technical support for guidance.
+
+Philosophy of mind is a branch of philosophy that studies the ontology and nature of the mind and its relationship with the body. The mind–body problem is a paradigmatic issue in philosophy of mind, although a number of other issues are addressed, such as the hard problem of consciousness and the nature of particular mental states.[1][2][3] Aspects of the mind that are studied include mental events, mental functions, mental properties, consciousness and its neural correlates, the ontology of the mind, the nature of cognition and of thought, and the relationship of the mind to the body.
+
+Dualism and monism are the two central schools of thought on the mind–body problem, although nuanced views have arisen that do not fit one or the other category neatly.
+
+    Dualism finds its entry into Western philosophy thanks to René Descartes in the 17th century.[4] Substance dualists like Descartes argue that the mind is an independently existing substance, whereas property dualists maintain that the mind is a group of independent properties that emerge from and cannot be reduced to the brain, but that it is not a distinct substance.[5]
+    Monism is the position that mind and body are ontologically indiscernible entities, not dependent substances. This view was espoused by the 17th-century rationalist Baruch Spinoza.[6] Physicalists argue that only entities postulated by physical theory exist, and that mental processes will eventually be explained in terms of these entities as physical theory continues to evolve. Physicalists maintain various positions on the prospects of reducing mental properties to physical properties (many of whom adopt compatible forms of property dualism),[7][8][9][10][11][12] and the ontological status of such mental properties remains unclear.[11][13][14] Idealists maintain that the mind is all that exists and that the external world is either mental itself, or an illusion created by the mind. Neutral monists such as Ernst Mach and William James argue that events in the world can be thought of as either mental (psychological) or physical depending on the network of relationships into which they enter, and dual-aspect monists such as Spinoza adhere to the position that there is some other, neutral substance, and that both matter and mind are properties of this unknown substance. The most common monisms in the 20th and 21st centuries have all been variations of physicalism; these positions include behaviorism, the type identity theory, anomalous monism and functionalism.[15]
+
+Most modern philosophers of mind adopt either a reductive physicalist or non-reductive physicalist position, maintaining in their different ways that the mind is not something separate from the body.[15] These approaches have been particularly influential in the sciences, especially in the fields of sociobiology, computer science (specifically, artificial intelligence), evolutionary psychology and the various neurosciences.[16][17][18][19] Reductive physicalists assert that all mental states and properties will eventually be explained by scientific accounts of physiological processes and states.[20][21][22] Non-reductive physicalists argue that although the mind is not a separate substance, mental properties supervene on physical properties, or that the predicates and vocabulary used in mental descriptions and explanations are indispensable, and cannot be reduced to the language and lower-level explanations of physical science.[23][24] Continued neuroscientific progress has helped to clarify some of these issues; however, they are far from being resolved. Modern philosophers of mind continue to ask how the subjective qualities and the intentionality of mental states and properties can be explained in naturalistic terms.[25][26]
+
+However, a number of issues have been recognized with non-reductive physicalism. First, it is irreconcilable with self-identity over time. Secondly, intentional states of consciousness do not make sense on non-reductive physicalism. Thirdly, free will is impossible to reconcile with either reductive or non-reductive physicalism. Fourthly, it fails to properly explain the phenomenon of mental causation.[27]
+Mind–body problem
+Main article: Mind–body problem
+René Descartes' illustration of mind/body dualism
+
+The mind–body problem concerns the explanation of the relationship that exists between minds, or mental processes, and bodily states or processes.[1] The main aim of philosophers working in this area is to determine the nature of the mind and mental states/processes, and how—or even if—minds are affected by and can affect the body.
+
+Perceptual experiences depend on stimuli that arrive at our various sensory organs from the external world, and these stimuli cause changes in our mental states, ultimately causing us to feel a sensation, which may be pleasant or unpleasant. Someone's desire for a slice of pizza, for example, will tend to cause that person to move his or her body in a specific manner and in a specific direction to obtain what he or she wants. The question, then, is how it can be possible for conscious experiences to arise out of a lump of gray matter endowed with nothing but electrochemical properties.[15]
+
+A related problem is how someone's propositional attitudes (e.g. beliefs and desires) cause that individual's neurons to fire and muscles to contract. These comprise some of the puzzles that have confronted epistemologists and philosophers of mind from the time of René Descartes.[4]
+Dualist solutions to the mind–body problem
+See also: Mind in eastern philosophy
+
+Dualism is a set of views about the relationship between mind and matter (or body). It begins with the claim that mental phenomena are, in some respects, non-physical.[5] One of the earliest known formulations of mind–body dualism was expressed in the eastern Samkhya and Yoga schools of Hindu philosophy (c. 650 BCE), which divided the world into purusha (mind/spirit) and prakriti (material substance).[28] Specifically, the Yoga Sutra of Patanjali presents an analytical approach to the nature of the mind.
+
+In Western Philosophy, the earliest discussions of dualist ideas are in the writings of Plato who suggested that humans' intelligence (a faculty of the mind or soul) could not be identified with, or explained in terms of, their physical body.[29][30] However, the best-known version of dualism is due to René Descartes (1641), and holds that the mind is a non-extended, non-physical substance, a "res cogitans".[4] Descartes was the first to clearly identify the mind with consciousness and self-awareness, and to distinguish this from the brain, which was the seat of intelligence. He was therefore the first to formulate the mind–body problem in the form in which it still exists today.[4]
+Arguments for dualism
+
+The most frequently used argument in favor of dualism appeals to the common-sense intuition that conscious experience is distinct from inanimate matter. If asked what the mind is, the average person would usually respond by identifying it with their self, their personality, their soul, or another related entity. They would almost certainly deny that the mind simply is the brain, or vice versa, finding the idea that there is just one ontological entity at play to be too mechanistic or unintelligible.[5] Modern philosophers of mind think that these intuitions are misleading, and that critical faculties, along with empirical evidence from the sciences, should be used to examine these assumptions and determine whether there is any real basis to them.[5]
+
+The mental and the physical seem to have quite different, and perhaps irreconcilable, properties.[31] Mental events have a subjective quality, whereas physical events do not. So, for example, one can reasonably ask what a burnt finger feels like, or what a blue sky looks like, or what nice music sounds like to a person. But it is meaningless, or at least odd, to ask what a surge in the uptake of glutamate in the dorsolateral portion of the prefrontal cortex feels like.
+
+Philosophers of mind call the subjective aspects of mental events "qualia" or "raw feels".[31] There are qualia involved in these mental events that seem particularly difficult to reduce to anything physical. David Chalmers explains this argument by stating that we could conceivably know all the objective information about something, such as the brain states and wavelengths of light involved with seeing the color red, but still not know something fundamental about the situation – what it is like to see the color red.[32]
+
+If consciousness (the mind) can exist independently of physical reality (the brain), one must explain how physical memories are created concerning consciousness. Dualism must therefore explain how consciousness affects physical reality. One possible explanation is that of a miracle, proposed by Arnold Geulincx and Nicolas Malebranche, where all mind–body interactions require the direct intervention of God.
+
+Another argument that has been proposed by C. S. Lewis[33] is the Argument from Reason: if, as monism implies, all of our thoughts are the effects of physical causes, then we have no reason for assuming that they are also the consequent of a reasonable ground. Knowledge, however, is apprehended by reasoning from ground to consequent. Therefore, if monism is correct, there would be no way of knowing this—or anything else—we could not even suppose it, except by a fluke.
+
+The zombie argument is based on a thought experiment proposed by Todd Moody, and developed by David Chalmers in his book The Conscious Mind. The basic idea is that one can imagine one's body, and therefore conceive the existence of one's body, without any conscious states being associated with this body. Chalmers' argument is that it seems possible that such a being could exist because all that is needed is that all and only the things that the physical sciences describe about a zombie must be true of it. Since none of the concepts involved in these sciences make reference to consciousness or other mental phenomena, and any physical entity can be by definition described scientifically via physics, the move from conceivability to possibility is not such a large one.[34] Others such as Dennett have argued that the notion of a philosophical zombie is an incoherent,[35] or unlikely,[36] concept. It has been argued under physicalism that one must either believe that anyone including oneself might be a zombie, or that no one can be a zombie—following from the assertion that one's own conviction about being (or not being) a zombie is a product of the physical world and is therefore no different from anyone else's. This argument has been expressed by Dennett who argues that "Zombies think they are conscious, think they have qualia, think they suffer pains—they are just 'wrong' (according to this lamentable tradition) in ways that neither they nor we could ever discover!"[35] See also the problem of other minds.
+Interactionist dualism
+Portrait of René Descartes by Frans Hals (1648)
+
+Interactionist dualism, or simply interactionism, is the particular form of dualism first espoused by Descartes in the Meditations.[4] In the 20th century, its major defenders have been Karl Popper and John Carew Eccles.[37] It is the view that mental states, such as beliefs and desires, causally interact with physical states.[5]
+
+Descartes's argument for this position can be summarized as follows: Seth has a clear and distinct idea of his mind as a thinking thing that has no spatial extension (i.e., it cannot be measured in terms of length, weight, height, and so on). He also has a clear and distinct idea of his body as something that is spatially extended, subject to quantification and not able to think. It follows that mind and body are not identical because they have radically different properties.[4]
+
+Seth's mental states (desires, beliefs, etc.) have causal effects on his body and vice versa: A child touches a hot stove (physical event) which causes pain (mental event) and makes her yell (physical event), this in turn provokes a sense of fear and protectiveness in the caregiver (mental event), and so on.
+
+Descartes' argument depends on the premise that what Seth believes to be "clear and distinct" ideas in his mind are necessarily true. Many contemporary philosophers doubt this.[38][39][40] For example, Joseph Agassi suggests that several scientific discoveries made since the early 20th century have undermined the idea of privileged access to one's own ideas. Freud claimed that a psychologically-trained observer can understand a person's unconscious motivations better than the person himself does. Duhem has shown that a philosopher of science can know a person's methods of discovery better than that person herself does, while Malinowski has shown that an anthropologist can know a person's customs and habits better than the person whose customs and habits they are. He also asserts that modern psychological experiments that cause people to see things that are not there provide grounds for rejecting Descartes' argument, because scientists can describe a person's perceptions better than the person herself can.[41][42]
+Other forms of dualism
+Four varieties of dualism. The arrows indicate the direction of the causal interactions. Occasionalism is not shown.
+Psychophysical parallelism
+
+Psychophysical parallelism, or simply parallelism, is the view that mind and body, while having distinct ontological statuses, do not causally influence one another. Instead, they run along parallel paths (mind events causally interact with mind events and brain events causally interact with brain events) and only seem to influence each other.[43] This view was most prominently defended by Gottfried Leibniz. Although Leibniz was an ontological monist who believed that only one type of substance, the monad, exists in the universe, and that everything is reducible to it, he nonetheless maintained that there was an important distinction between "the mental" and "the physical" in terms of causation. He held that God had arranged things in advance so that minds and bodies would be in harmony with each other. This is known as the doctrine of pre-established harmony.[44]
+Occasionalism
+
+Occasionalism is the view espoused by Nicholas Malebranche as well as Islamic philosophers such as Abu Hamid Muhammad ibn Muhammad al-Ghazali that asserts all supposedly causal relations between physical events, or between physical and mental events, are not really causal at all. While body and mind are different substances, causes (whether mental or physical) are related to their effects by an act of God's intervention on each specific occasion.[45]
+Property dualism
+
+Property dualism is the view that the world is constituted of one kind of substance – the physical kind – and there exist two distinct kinds of properties: physical properties and mental properties. It is the view that non-physical, mental properties (such as beliefs, desires and emotions) inhere in some physical bodies (at least, brains). Sub-varieties of property dualism include:
+
+    Emergent materialism asserts that when matter is organized in the appropriate way (i.e. in the way that living human bodies are organized), mental properties emerge in a way not fully accountable for by physical laws.[5] These emergent properties have an independent ontological status and cannot be reduced to, or explained in terms of, the physical substrate from which they emerge. They are dependent on the physical properties from which they emerge, but opinions vary as to the coherence of top–down causation, i.e. the causal effectiveness of such properties. A form of emergent materialism has been espoused by David Chalmers and the concept has undergone something of a renaissance in recent years,[46] but it was already suggested in the 19th century by William James.
+    Epiphenomenalism is a doctrine first formulated by Thomas Henry Huxley.[47] It consists of the view that mental phenomena are causally ineffectual, where one or more mental states do not have any influence on physical states or mental phenomena are the effects, but not the causes, of physical phenomena. Physical events can cause other physical and mental events, but mental events cannot cause anything since they are just causally inert by-products (i.e. epiphenomena) of the physical world.[43] This view has been defended by Frank Jackson.[48]
+    Non-reductive physicalism is the view that mental properties form a separate ontological class to physical properties: mental states (such as qualia) are not reducible to physical states. The ontological stance towards qualia in the case of non-reductive physicalism does not imply that qualia are causally inert; this is what distinguishes it from epiphenomenalism.
+    Panpsychism is the view that all matter has a mental aspect, or, alternatively, all objects have a unified center of experience or point of view. Superficially, it seems to be a form of property dualism, since it regards everything as having both mental and physical properties. However, some panpsychists say that mechanical behaviour is derived from the primitive mentality of atoms and molecules—as are sophisticated mentality and organic behaviour, the difference being attributed to the presence or absence of complex structure in a compound object. So long as the reduction of non-mental properties to mental ones is in place, panpsychism is not a (strong) form of property dualism; otherwise it is.
+
+Dual aspect theory
+
+Dual aspect theory or dual-aspect monism is the view that the mental and the physical are two aspects of, or perspectives on, the same substance. (Thus it is a mixed position, which is monistic in some respects). In modern philosophical writings, the theory's relationship to neutral monism has become somewhat ill-defined, but one proffered distinction says that whereas neutral monism allows the context of a given group of neutral elements and the relationships into which they enter to determine whether the group can be thought of as mental, physical, both, or neither, dual-aspect theory suggests that the mental and the physical are manifestations (or aspects) of some underlying substance, entity or process that is itself neither mental nor physical as normally understood. Various formulations of dual-aspect monism also require the mental and the physical to be complementary, mutually irreducible and perhaps inseparable (though distinct).[49][50][51]
+Experiential dualism
+
+This is a philosophy of mind that regards the degrees of freedom between mental and physical well-being as not synonymous thus implying an experiential dualism between body and mind. An example of these disparate degrees of freedom is given by Allan Wallace who notes that it is "experientially apparent that one may be physically uncomfortable—for instance, while engaging in a strenuous physical workout—while mentally cheerful; conversely, one may be mentally distraught while experiencing physical comfort".[52] Experiential dualism notes that our subjective experience of merely seeing something in the physical world seems qualitatively different from mental processes like grief that comes from losing a loved one. This philosophy is a proponent of causal dualism, which is defined as the dual ability for mental states and physical states to affect one another. Mental states can cause changes in physical states and vice versa.
+
+However, unlike cartesian dualism or some other systems, experiential dualism does not posit two fundamental substances in reality: mind and matter. Rather, experiential dualism is to be understood as a conceptual framework that gives credence to the qualitative difference between the experience of mental and physical states. Experiential dualism is accepted as the conceptual framework of Madhyamaka Buddhism.
+
+Madhayamaka Buddhism goes further, finding fault with the monist view of physicalist philosophies of mind as well in that these generally posit matter and energy as the fundamental substance of reality. Nonetheless, this does not imply that the cartesian dualist view is correct, rather Madhyamaka regards as error any affirming view of a fundamental substance to reality.
+
+    In denying the independent self-existence of all the phenomena that make up the world of our experience, the Madhyamaka view departs from both the substance dualism of Descartes and the substance monism—namely, physicalism—that is characteristic of modern science. The physicalism propounded by many contemporary scientists seems to assert that the real world is composed of physical things-in-themselves, while all mental phenomena are regarded as mere appearances, devoid of any reality in and of themselves. Much is made of this difference between appearances and reality.[52]
+
+Indeed, physicalism, or the idea that matter is the only fundamental substance of reality, is explicitly rejected by Buddhism.
+
+    In the Madhyamaka view, mental events are no more or less real than physical events. In terms of our common-sense experience, differences of kind do exist between physical and mental phenomena. While the former commonly have mass, location, velocity, shape, size, and numerous other physical attributes, these are not generally characteristic of mental phenomena. For example, we do not commonly conceive of the feeling of affection for another person as having mass or location. These physical attributes are no more appropriate to other mental events such as sadness, a recalled image from one's childhood, the visual perception of a rose, or consciousness of any sort. Mental phenomena are, therefore, not regarded as being physical, for the simple reason that they lack many of the attributes that are uniquely characteristic of physical phenomena. Thus, Buddhism has never adopted the physicalist principle that regards only physical things as real.[52]
+
+Monist solutions to the mind–body problem
+
+In contrast to dualism, monism does not accept any fundamental divisions. The fundamentally disparate nature of reality has been central to forms of eastern philosophies for over two millennia. In Indian and Chinese philosophy, monism is integral to how experience is understood. Today, the most common forms of monism in Western philosophy are physicalist.[15] Physicalistic monism asserts that the only existing substance is physical, in some sense of that term to be clarified by our best science.[53] However, a variety of formulations (see below) are possible. Another form of monism, idealism, states that the only existing substance is mental. Although pure idealism, such as that of George Berkeley, is uncommon in contemporary Western philosophy, a more sophisticated variant called panpsychism, according to which mental experience and properties may be at the foundation of physical experience and properties, has been espoused by some philosophers such as Alfred North Whitehead[54] and David Ray Griffin.[46]
+
+Phenomenalism is the theory that representations (or sense data) of external objects are all that exist. Such a view was briefly adopted by Bertrand Russell and many of the logical positivists during the early 20th century.[55] A third possibility is to accept the existence of a basic substance that is neither physical nor mental. The mental and physical would then both be properties of this neutral substance. Such a position was adopted by Baruch Spinoza[6] and was popularized by Ernst Mach[56] in the 19th century. This neutral monism, as it is called, resembles property dualism.
+Physicalistic monisms
+Behaviorism
+Main article: Behaviorism
+
+Behaviorism dominated philosophy of mind for much of the 20th century, especially the first half.[15] In psychology, behaviorism developed as a reaction to the inadequacies of introspectionism.[53] Introspective reports on one's own interior mental life are not subject to careful examination for accuracy and cannot be used to form predictive generalizations. Without generalizability and the possibility of third-person examination, the behaviorists argued, psychology cannot be scientific.[53] The way out, therefore, was to eliminate the idea of an interior mental life (and hence an ontologically independent mind) altogether and focus instead on the description of observable behavior.[57]
+
+Parallel to these developments in psychology, a philosophical behaviorism (sometimes called logical behaviorism) was developed.[53] This is characterized by a strong verificationism, which generally considers unverifiable statements about interior mental life pointless. For the behaviorist, mental states are not interior states on which one can make introspective reports. They are just descriptions of behavior or dispositions to behave in certain ways, made by third parties to explain and predict another's behavior.[58]
+
+Philosophical behaviorism has fallen out of favor since the latter half of the 20th century, coinciding with the rise of cognitivism.[1]
+Identity theory
+Main article: Type physicalism
+
+Type physicalism (or type-identity theory) was developed by Jack Smart[22] and Ullin Place[59] as a direct reaction to the failure of behaviorism. These philosophers reasoned that, if mental states are something material, but not behavioral, then mental states are probably identical to internal states of the brain. In very simplified terms: a mental state M is nothing other than brain state B. The mental state "desire for a cup of coffee" would thus be nothing more than the "firing of certain neurons in certain brain regions".[22]
+The classic Identity theory and Anomalous Monism in contrast. For the Identity theory, every token instantiation of a single mental type corresponds (as indicated by the arrows) to a physical token of a single physical type. For anomalous monism, the token–token correspondences can fall outside of the type–type correspondences. The result is token identity.
+
+On the other hand, even granted the above, it does not follow that identity theories of all types must be abandoned. According to token identity theories, the fact that a certain brain state is connected with only one mental state of a person does not have to mean that there is an absolute correlation between types of mental state and types of brain state. The type–token distinction can be illustrated by a simple example: the word "green" contains four types of letters (g, r, e, n) with two tokens (occurrences) of the letter e along with one each of the others. The idea of token identity is that only particular occurrences of mental events are identical with particular occurrences or tokenings of physical events.[60] Anomalous monism (see below) and most other non-reductive physicalisms are token-identity theories.[61] Despite these problems, there is a renewed interest in the type identity theory today, primarily due to the influence of Jaegwon Kim.[22]
+Functionalism
+Main article: Functionalism (philosophy of mind)
+
+Functionalism was formulated by Hilary Putnam and Jerry Fodor as a reaction to the inadequacies of the identity theory.[24] Putnam and Fodor saw mental states in terms of an empirical computational theory of the mind.[62] At about the same time or slightly after, D.M. Armstrong and David Kellogg Lewis formulated a version of functionalism that analyzed the mental concepts of folk psychology in terms of functional roles.[63] Finally, Wittgenstein's idea of meaning as use led to a version of functionalism as a theory of meaning, further developed by Wilfrid Sellars and Gilbert Harman. Another one, psychofunctionalism, is an approach adopted by the naturalistic philosophy of mind associated with Jerry Fodor and Zenon Pylyshyn.
+
+Mental states are characterized by their causal relations with other mental states and with sensory inputs and behavioral outputs. Functionalism abstracts away from the details of the physical implementation of a mental state by characterizing it in terms of non-mental functional properties. For example, a kidney is characterized scientifically by its functional role in filtering blood and maintaining certain chemical balances.[62]
+Non-reductive physicalism
+Main article: Physicalism
+
+Non-reductionist philosophers hold firmly to two essential convictions with regard to mind–body relations: 1) Physicalism is true and mental states must be physical states, but 2) All reductionist proposals are unsatisfactory: mental states cannot be reduced to behavior, brain states or functional states.[53] Hence, the question arises whether there can still be a non-reductive physicalism. Donald Davidson's anomalous monism[23] is an attempt to formulate such a physicalism. He "thinks that when one runs across what are traditionally seen as absurdities of Reason, such as akrasia or self-deception, the personal psychology framework is not to be given up in favor of the subpersonal one, but rather must be enlarged or extended so that the rationality set out by the principle of charity can be found elsewhere."[64]
+
+Davidson uses the thesis of supervenience: mental states supervene on physical states, but are not reducible to them. "Supervenience" therefore describes a functional dependence: there can be no change in the mental without some change in the physical–causal reducibility between the mental and physical without ontological reducibility.[65]
+
+Non-reductive physicalism, however, is irreconcilable with self-identity over time [source?]. The brain goes on from one moment of time to another; the brain thus has identity through time. But its states of awareness do not go on from one moment to the next. There is no enduring self – no “I” (capital-I) that goes on from one moment to the next. An analogy of the self or the “I” would be the flame of a candle. The candle and the wick go on from one moment to the next, but the flame does not go on. There is a different flame at each moment of the candle’s burning. The flame displays a type of continuity in that the candle does not go out while it is burning, but there is not really any identity of the flame from one moment to another over time. The scenario is similar on non-reductive physicalism with states of awareness. Every state of the brain at different times has a different state of awareness related to it, but there is no enduring self or “I” from one moment to the next. Similarly, it is an illusion that one is the same individual who walked into class this morning. In fact, one is not the same individual because there is no personal identity over time. If one does exist and one is the same individual who entered into class this morning, then a non-reductive physicalist view of the self should be dismissed.[27]
+
+Because non-reductive physicalist theories attempt to both retain the ontological distinction between mind and body and try to solve the "surfeit of explanations puzzle" in some way; critics often see this as a paradox and point out the similarities to epiphenomenalism, in that it is the brain that is seen as the root "cause" not the mind, and the mind seems to be rendered inert.
+
+Epiphenomenalism regards one or more mental states as the byproduct of physical brain states, having no influence on physical states. The interaction is one-way (solving the "surfeit of explanations puzzle") but leaving us with non-reducible mental states (as a byproduct of brain states) – causally reducible, but ontologically irreducible to physical states. Pain would be seen by epiphenomenalists as being caused by the brain state but as not having effects on other brain states, though it might have effects on other mental states (i.e. cause distress).
+Weak emergentism
+Main article: Emergentism
+
+Weak emergentism is a form of "non-reductive physicalism" that involves a layered view of nature, with the layers arranged in terms of increasing complexity and each corresponding to its own special science. Some philosophers[who?] hold that emergent properties causally interact with more fundamental levels, while others maintain that higher-order properties simply supervene over lower levels without direct causal interaction. The latter group therefore holds a less strict, or "weaker", definition of emergentism, which can be rigorously stated as follows: a property P of composite object O is emergent if it is metaphysically impossible for another object to lack property P if that object is composed of parts with intrinsic properties identical to those in O and has those parts in an identical configuration.[citation needed]
+
+Sometimes emergentists use the example of water having a new property when Hydrogen H and Oxygen O combine to form H2O (water). In this example there "emerges" a new property of a transparent liquid that would not have been predicted by understanding hydrogen and oxygen as gases. This is analogous to physical properties of the brain giving rise to a mental state. Emergentists try to solve the notorious mind–body gap this way. One problem for emergentism is the idea of causal closure in the world that does not allow for a mind-to-body causation.[66]
+Eliminative materialism
+Main article: Eliminative materialism
+
+If one is a materialist and believes that all aspects of our common-sense psychology will find reduction to a mature cognitive neuroscience, and that non-reductive materialism is mistaken, then one can adopt a final, more radical position: eliminative materialism.
+
+There are several varieties of eliminative materialism, but all maintain that our common-sense "folk psychology" badly misrepresents the nature of some aspect of cognition. Eliminativists such as Patricia and Paul Churchland argue that while folk psychology treats cognition as fundamentally sentence-like, the non-linguistic vector/matrix model of neural network theory or connectionism will prove to be a much more accurate account of how the brain works.[20]
+
+The Churchlands often invoke the fate of other, erroneous popular theories and ontologies that have arisen in the course of history.[20][21] For example, Ptolemaic astronomy served to explain and roughly predict the motions of the planets for centuries, but eventually this model of the solar system was eliminated in favor of the Copernican model. The Churchlands believe the same eliminative fate awaits the "sentence-cruncher" model of the mind in which thought and behavior are the result of manipulating sentence-like states called "propositional attitudes". Sociologist Jacy Reese Anthis argues for eliminative materialism on all faculties of mind, including consciousness, stating, "The deepest mysteries of the mind are within our reach."[67]
+Mysterianism
+Main article: New mysterianism
+
+Some philosophers take an epistemic approach and argue that the mind–body problem is currently unsolvable, and perhaps will always remain unsolvable to human beings. This is usually termed New mysterianism. Colin McGinn holds that human beings are cognitively closed in regards to their own minds. According to McGinn human minds lack the concept-forming procedures to fully grasp how mental properties such as consciousness arise from their causal basis.[68] An example would be how an elephant is cognitively closed in regards to particle physics.
+
+A more moderate conception has been expounded by Thomas Nagel, which holds that the mind–body problem is currently unsolvable at the present stage of scientific development and that it might take a future scientific paradigm shift or revolution to bridge the explanatory gap. Nagel posits that in the future a sort of "objective phenomenology" might be able to bridge the gap between subjective conscious experience and its physical basis.[69]
+Linguistic criticism of the mind–body problem
+
+Each attempt to answer the mind–body problem encounters substantial problems. Some philosophers argue that this is because there is an underlying conceptual confusion.[70] These philosophers, such as Ludwig Wittgenstein and his followers in the tradition of linguistic criticism, therefore reject the problem as illusory.[71] They argue that it is an error to ask how mental and biological states fit together. Rather it should simply be accepted that human experience can be described in different ways—for instance, in a mental and in a biological vocabulary. Illusory problems arise if one tries to describe the one in terms of the other's vocabulary or if the mental vocabulary is used in the wrong contexts.[71] This is the case, for instance, if one searches for mental states of the brain. The brain is simply the wrong context for the use of mental vocabulary—the search for mental states of the brain is therefore a category error or a sort of fallacy of reasoning.[71]
+
+Today, such a position is often adopted by interpreters of Wittgenstein such as Peter Hacker.[70] However, Hilary Putnam, the originator of functionalism, has also adopted the position that the mind–body problem is an illusory problem which should be dissolved according to the manner of Wittgenstein.[72]
+Naturalism and its problems
+
+The thesis of physicalism is that the mind is part of the material (or physical) world. Such a position faces the problem that the mind has certain properties that no other material thing seems to possess. Physicalism must therefore explain how it is possible that these properties can nonetheless emerge from a material thing. The project of providing such an explanation is often referred to as the "naturalization of the mental".[53] Some of the crucial problems that this project attempts to resolve include the existence of qualia and the nature of intentionality.[53]
+Qualia
+Main article: Qualia
+
+Many mental states seem to be experienced subjectively in different ways by different individuals.[32] And it is characteristic of a mental state that it has some experiential quality, e.g. of pain, that it hurts. However, the sensation of pain between two individuals may not be identical, since no one has a perfect way to measure how much something hurts or of describing exactly how it feels to hurt. Philosophers and scientists therefore ask where these experiences come from. The existence of cerebral events, in and of themselves, cannot explain why they are accompanied by these corresponding qualitative experiences. The puzzle of why many cerebral processes occur with an accompanying experiential aspect in consciousness seems impossible to explain.[31]
+
+Yet it also seems to many that science will eventually have to explain such experiences.[53] This follows from an assumption about the possibility of reductive explanations. According to this view, if an attempt can be successfully made to explain a phenomenon reductively (e.g., water), then it can be explained why the phenomenon has all of its properties (e.g., fluidity, transparency).[53] In the case of mental states, this means that there needs to be an explanation of why they have the property of being experienced in a certain way.
+
+The 20th-century German philosopher Martin Heidegger criticized the ontological assumptions underpinning such a reductive model, and claimed that it was impossible to make sense of experience in these terms. This is because, according to Heidegger, the nature of our subjective experience and its qualities is impossible to understand in terms of Cartesian "substances" that bear "properties". Another way to put this is that the very concept of qualitative experience is incoherent in terms of—or is semantically incommensurable with the concept of—substances that bear properties.[73]
+
+This problem of explaining introspective first-person aspects of mental states and consciousness in general in terms of third-person quantitative neuroscience is called the explanatory gap.[74] There are several different views of the nature of this gap among contemporary philosophers of mind. David Chalmers and the early Frank Jackson interpret the gap as ontological in nature; that is, they maintain that qualia can never be explained by science because physicalism is false. There are two separate categories involved and one cannot be reduced to the other.[75] An alternative view is taken by philosophers such as Thomas Nagel and Colin McGinn. According to them, the gap is epistemological in nature. For Nagel, science is not yet able to explain subjective experience because it has not yet arrived at the level or kind of knowledge that is required. We are not even able to formulate the problem coherently.[32] For McGinn, on other hand, the problem is one of permanent and inherent biological limitations. We are not able to resolve the explanatory gap because the realm of subjective experiences is cognitively closed to us in the same manner that quantum physics is cognitively closed to elephants.[76] Other philosophers liquidate the gap as purely a semantic problem. This semantic problem, of course, led to the famous "Qualia Question", which is: Does Red cause Redness?
+Intentionality
+Main article: Intentionality
+John Searle—one of the most influential philosophers of mind, proponent of biological naturalism (Berkeley 2002)
+
+Intentionality is the capacity of mental states to be directed towards (about) or be in relation with something in the external world.[26] This property of mental states entails that they have contents and semantic referents and can therefore be assigned truth values. When one tries to reduce these states to natural processes there arises a problem: natural processes are not true or false, they simply happen.[77] It would not make any sense to say that a natural process is true or false. But mental ideas or judgments are true or false, so how then can mental states (ideas or judgments) be natural processes? The possibility of assigning semantic value to ideas must mean that such ideas are about facts. Thus, for example, the idea that Herodotus was a historian refers to Herodotus and to the fact that he was a historian. If the fact is true, then the idea is true; otherwise, it is false. But where does this relation come from? In the brain, there are only electrochemical processes and these seem not to have anything to do with Herodotus.[25]
+Philosophy of perception
+Main article: Philosophy of perception
+
+Philosophy of perception is concerned with the nature of perceptual experience and the status of perceptual objects, in particular how perceptual experience relates to appearances and beliefs about the world. The main contemporary views within philosophy of perception include naive realism, enactivism and representational views.[2][3][78]
+A phrenological mapping of the brain – phrenology was among the first attempts to correlate mental functions with specific parts of the brain although it is now widely discredited.
+Philosophy of mind and science
+
+Humans are corporeal beings and, as such, they are subject to examination and description by the natural sciences. Since mental processes are intimately related to bodily processes (e.g., embodied cognition theory of mind), the descriptions that the natural sciences furnish of human beings play an important role in the philosophy of mind.[1] There are many scientific disciplines that study processes related to the mental. The list of such sciences includes: biology, computer science, cognitive science, cybernetics, linguistics, medicine, pharmacology, and psychology.[79]
+Neurobiology
+Main article: Neuroscience
+
+The theoretical background of biology, as is the case with modern natural sciences in general, is fundamentally materialistic. The objects of study are, in the first place, physical processes, which are considered to be the foundations of mental activity and behavior.[80] The increasing success of biology in the explanation of mental phenomena can be seen by the absence of any empirical refutation of its fundamental presupposition: "there can be no change in the mental states of a person without a change in brain states."[79]
+
+Within the field of neurobiology, there are many subdisciplines that are concerned with the relations between mental and physical states and processes:[80] Sensory neurophysiology investigates the relation between the processes of perception and stimulation.[81] Cognitive neuroscience studies the correlations between mental processes and neural processes.[81] Neuropsychology describes the dependence of mental faculties on specific anatomical regions of the brain.[81] Lastly, evolutionary biology studies the origins and development of the human nervous system and, in as much as this is the basis of the mind, also describes the ontogenetic and phylogenetic development of mental phenomena beginning from their most primitive stages.[79] Evolutionary biology furthermore places tight constraints on any philosophical theory of the mind, as the gene-based mechanism of natural selection does not allow any giant leaps in the development of neural complexity or neural software but only incremental steps over long time periods.[82]
+Since the 1980s, sophisticated neuroimaging procedures, such as fMRI (above), have furnished increasing knowledge about the workings of the human brain, shedding light on ancient philosophical problems.
+
+The methodological breakthroughs of the neurosciences, in particular the introduction of high-tech neuroimaging procedures, has propelled scientists toward the elaboration of increasingly ambitious research programs: one of the main goals is to describe and comprehend the neural processes which correspond to mental functions (see: neural correlate).[80] Several groups are inspired by these advances.
+Computer science
+Main article: Computer science
+
+Computer science concerns itself with the automatic processing of information (or at least with physical systems of symbols to which information is assigned) by means of such things as computers.[83] From the beginning, computer programmers have been able to develop programs that permit computers to carry out tasks for which organic beings need a mind. A simple example is multiplication. It is not clear whether computers could be said to have a mind. Could they, someday, come to have what we call a mind? This question has been propelled into the forefront of much philosophical debate because of investigations in the field of artificial intelligence (AI).
+
+Within AI, it is common to distinguish between a modest research program and a more ambitious one: this distinction was coined by John Searle in terms of a weak AI and strong AI. The exclusive objective of "weak AI", according to Searle, is the successful simulation of mental states, with no attempt to make computers become conscious or aware, etc. The objective of strong AI, on the contrary, is a computer with consciousness similar to that of human beings.[84] The program of strong AI goes back to one of the pioneers of computation Alan Turing. As an answer to the question "Can computers think?", he formulated the famous Turing test.[85] Turing believed that a computer could be said to "think" when, if placed in a room by itself next to another room that contained a human being and with the same questions being asked of both the computer and the human being by a third party human being, the computer's responses turned out to be indistinguishable from those of the human. Essentially, Turing's view of machine intelligence followed the behaviourist model of the mind—intelligence is as intelligence does. The Turing test has received many criticisms, among which the most famous is probably the Chinese room thought experiment formulated by Searle.[84]
+
+The question about the possible sensitivity (qualia) of computers or robots still remains open. Some computer scientists believe that the specialty of AI can still make new contributions to the resolution of the "mind–body problem". They suggest that based on the reciprocal influences between software and hardware that takes place in all computers, it is possible that someday theories can be discovered that help us to understand the reciprocal influences between the human mind and the brain (wetware).[86]
+Psychology
+Main article: Psychology
+
+Psychology is the science that investigates mental states directly. It uses generally empirical methods to investigate concrete mental states like joy, fear or obsessions. Psychology investigates the laws that bind these mental states to each other or with inputs and outputs to the human organism.[87]
+
+An example of this is the psychology of perception. Scientists working in this field have discovered general principles of the perception of forms. A law of the psychology of forms says that objects that move in the same direction are perceived as related to each other.[79] This law describes a relation between visual input and mental perceptual states. However, it does not suggest anything about the nature of perceptual states. The laws discovered by psychology are compatible with all the answers to the mind–body problem already described.
+Cognitive science
+
+Cognitive science is the interdisciplinary scientific study of the mind and its processes. It examines what cognition is, what it does, and how it works. It includes research on intelligence and behavior, especially focusing on how information is represented, processed, and transformed (in faculties such as perception, language, memory, reasoning, and emotion) within nervous systems (human or other animals) and machines (e.g. computers). Cognitive science consists of multiple research disciplines, including psychology, artificial intelligence, philosophy, neuroscience, linguistics, anthropology, sociology, and education.[88] It spans many levels of analysis, from low-level learning and decision mechanisms to high-level logic and planning; from neural circuitry to modular brain organization. Over the years, cognitive science has evolved from a representational and information processing approach to explaining the mind to embrace an embodied perspective of it. Accordingly, bodily processes play a significant role in the acquisition, development, and shaping of cognitive capabilities.[89] For instance, Rowlands (2012) argues that cognition is enactive, embodied, embedded, affective and (potentially) extended. The position is taken that the "classical sandwich" of cognition sandwiched between perception and action is artificial; cognition has to be seen as a product of a strongly coupled interaction that cannot be divided this way.[90][91]
+Near-death research
+Main article: Near-death studies
+
+In the field of near-death research, the following phenomenon, among others, occurs: For example, during some brain operations the brain is artificially and measurably deactivated. Nevertheless, some patients report during this phase that they have perceived what is happening in their surroundings, i.e. that they have had consciousness. Patients also report experiences during a cardiac arrest. There is the following problem: As soon as the brain is no longer supplied with blood and thus with oxygen after a cardiac arrest, the brain ceases its normal operation after about 15 seconds, i.e. the brain falls into a state of unconsciousness.[92]
+Philosophy of mind in the continental tradition
+
+Most of the discussion in this article has focused on one style or tradition of philosophy in modern Western culture, usually called analytic philosophy (sometimes described as Anglo-American philosophy).[93] Many other schools of thought exist, however, which are sometimes subsumed under the broad (and vague) label of continental philosophy.[93] In any case, though topics and methods here are numerous, in relation to the philosophy of mind the various schools that fall under this label (phenomenology, existentialism, etc.) can globally be seen to differ from the analytic school in that they focus less on language and logical analysis alone but also take in other forms of understanding human existence and experience. With reference specifically to the discussion of the mind, this tends to translate into attempts to grasp the concepts of thought and perceptual experience in some sense that does not merely involve the analysis of linguistic forms.[93]
+
+Immanuel Kant's Critique of Pure Reason, first published in 1781 and presented again with major revisions in 1787, represents a significant intervention into what will later become known as the philosophy of mind. Kant's first critique is generally recognized as among the most significant works of modern philosophy in the West. Kant is a figure whose influence is marked in both continental and analytic/Anglo-American philosophy. Kant's work develops an in-depth study of transcendental consciousness, or the life of the mind as conceived through the universal categories of understanding.
+
+In Georg Wilhelm Friedrich Hegel's Philosophy of Mind (frequently translated as Philosophy of Spirit or Geist),[94] the third part of his Encyclopedia of the Philosophical Sciences, Hegel discusses three distinct types of mind: the "subjective mind/spirit", the mind of an individual; the "objective mind/spirit", the mind of society and of the State; and the "Absolute mind/spirit", the position of religion, art, and philosophy. See also Hegel's The Phenomenology of Spirit. Nonetheless, Hegel's work differs radically from the style of Anglo-American philosophy of mind.
+
+In 1896, Henri Bergson made in Matter and Memory "Essay on the relation of body and spirit" a forceful case for the ontological difference of body and mind by reducing the problem to the more definite one of memory, thus allowing for a solution built on the empirical test case of aphasia.
+
+In modern times, the two main schools that have developed in response or opposition to this Hegelian tradition are phenomenology and existentialism. Phenomenology, founded by Edmund Husserl, focuses on the contents of the human mind (see noema) and how processes shape our experiences.[95] Existentialism, a school of thought founded upon the work of Søren Kierkegaard, focuses on Human predicament and how people deal with the situation of being alive. Existential-phenomenology represents a major branch of continental philosophy (they are not contradictory), rooted in the work of Husserl but expressed in its fullest forms in the work of Martin Heidegger, Jean-Paul Sartre, Simone de Beauvoir and Maurice Merleau-Ponty. See Heidegger's Being and Time, Merleau-Ponty's Phenomenology of Perception, Sartre's Being and Nothingness, and Simone de Beauvoir's The Second Sex.
+Topics related to philosophy of mind
+
+There are countless subjects that are affected by the ideas developed in the philosophy of mind. Clear examples of this are the nature of death and its definitive character, the nature of emotion, of perception and of memory. Questions about what a person is and what his or her identity have to do with the philosophy of mind. There are two subjects that, in connection with the philosophy of the mind, have aroused special attention: free will and the self.[1]
+Free will
+Main article: Free will
+
+In the context of philosophy of mind, the problem of free will takes on renewed intensity. This is the case for materialistic determinists.[1] According to this position, natural laws completely determine the course of the material world. Mental states, and therefore the will as well, would be material states, which means human behavior and decisions would be completely determined by natural laws. Some take this reasoning a step further: people cannot determine by themselves what they want and what they do. Consequently, they are not free.[96]
+
+This argumentation is rejected, on the one hand, by the compatibilists. Those who adopt this position suggest that the question "Are we free?" can only be answered once we have determined what the term "free" means. The opposite of "free" is not "caused" but "compelled" or "coerced". It is not appropriate to identify freedom with indetermination. A free act is one where the agent could have done otherwise if it had chosen otherwise. In this sense a person can be free even though determinism is true.[96] The most important compatibilist in the history of the philosophy was David Hume.[97] More recently, this position is defended, for example, by Daniel Dennett.[98]
+
+On the other hand, there are also many incompatibilists who reject the argument because they believe that the will is free in a stronger sense called libertarianism.[96] These philosophers affirm the course of the world is either a) not completely determined by natural law where natural law is intercepted by physically independent agency,[99] b) determined by indeterministic natural law only, or c) determined by indeterministic natural law in line with the subjective effort of physically non-reducible agency.[100] Under Libertarianism, the will does not have to be deterministic and, therefore, it is potentially free. Critics of the second proposition (b) accuse the incompatibilists of using an incoherent concept of freedom. They argue as follows: if our will is not determined by anything, then we desire what we desire by pure chance. And if what we desire is purely accidental, we are not free. So if our will is not determined by anything, we are not free.[96]
+Self
+Main article: Philosophy of self
+
+The philosophy of mind also has important consequences for the concept of "self". If by "self" or "I" one refers to an essential, immutable nucleus of the person, some modern philosophers of mind, such as Daniel Dennett believe that no such thing exists. According to Dennett and other contemporaries, the self is considered an illusion.[101] The idea of a self as an immutable essential nucleus derives from the idea of an immaterial soul. Such an idea is unacceptable to modern philosophers with physicalist orientations and their general skepticism of the concept of "self" as postulated by David Hume, who could never catch himself not doing, thinking or feeling anything.[102] However, in the light of empirical results from developmental psychology, developmental biology and neuroscience, the idea of an essential inconstant, material nucleus—an integrated representational system distributed over changing patterns of synaptic connections—seems reasonable.[103]
+
+
+
+How is a sovereign state defined?
+
+A sovereign state is an entity with a permanent population, a defined territory, an effective government, and the capacity to conduct international relations. These criteria are often loosely applied. For example, boundary disputes and ongoing civil wars do not necessarily prevent an entity from becoming a state if it is formally independent from other states.
+Does a state need to get recognition from other states?
+
+No, a state technically does not need to get recognition from other states. Under the prevailing declaratory theory of recognition, a state exists if it meets the necessary criteria that define states. Recognition is simply an acknowledgment of an existing situation. (The minority constitutive theory of recognition holds that recognition is necessary to the existence of a state.)
+When can a state use military force against another state?
+
+A state can use military force against another state only in self-defense against an armed attack. This right arises from Article 51 of the United Nations Charter, which incorporates inherent rights from customary international law. Any acts of self-defense must be necessary and proportionate to the acts of aggression. Acts of anticipatory self-defense may be permitted when an armed attack is imminent and inevitable, although the UN Charter does not address this situation.
+What does international humanitarian law do?
+
+International humanitarian law restricts the ways in which wars can be conducted. It protects the safety of non-combatants, as well as former combatants like prisoners of war. It also bans the use of certain weapons or tactics that inflict unnecessary harm or suffering, cause severe or lasting harm to the environment, or cannot be used in a way that allows those using them to distinguish between combatant and non-combatant targets.
+What are some of the human rights guaranteed by international law?
+
+Human rights guaranteed by international law include civil, political, economic, social, and cultural rights. Examples include freedom of expression, freedom of religion, freedom of association, the right to an adequate standard of living, the right to work in favorable conditions, the right to education, and protections against arbitrary arrest and detention. These rights are codified in the Universal Declaration of Human Rights and other United Nations instruments, known collectively as the International Bill of Human Rights.
+What is the concept of sustainable development?
+
+Sustainable development is defined as meeting the present needs of a generation without preventing future generations from meeting their needs. It has been a guiding principle of international environmental law since the Earth Summit in Rio de Janeiro in 1992, and it even has influenced economic treaties. However, sustainable development has not yet been achieved, despite some legal and political progress.
+What are the main organs of the United Nations?
+
+The main organs of the United Nations are the General Assembly, the Security Council, the Secretariat, the International Court of Justice, the Economic and Social Council, and the Trusteeship Council. The General Assembly is a representative policy-making organ in which member states vote on resolutions and other actions. The Security Council protects international peace and security, approves changes to the UN Charter, and recommends new UN member states. Led by the UN Secretary-General, the Secretariat carries out the mandates of the General Assembly and other UN organs. The International Court of Justice resolves disputes between states and issues advisory opinions to non-state organizations. The Economic and Social Council develops policy recommendations based on meetings and consultations. The Trusteeship Council has been inactive since the 1990s, when the last UN Trust Territory gained independence.
+Which cases are heard by the International Court of Justice?
+
+The International Court of Justice has contentious jurisdiction and advisory jurisdiction. Its contentious jurisdiction involves resolving disputes between states under international law. Each state involved in a dispute must consent to ICJ jurisdiction. While contentious jurisdiction leads to binding decisions, advisory jurisdiction involves issuing non-binding opinions to public international organizations. These opinions generally carry great weight and can resolve ambiguities in international law.
+How are treaties different from executive agreements under US law?
+
+A treaty requires the advice and consent of two-thirds of the Senate, and it must be ratified by the President. An executive agreement can be negotiated by the President without the advice and consent of two-thirds of the Senate. In a congressional-executive agreement, the President gets the approval of a simple majority of both houses of Congress. In a sole executive agreement, the President acts without involving Congress. However, treaties and executive agreements are equally binding under international law.
+When does a treaty supersede federal laws?
+
+A treaty supersedes prior inconsistent federal laws if Congress implements it through new federal laws or if it is self-executing. A treaty is self-executing if there is an intent to make it enforceable under US law without additional implementing legislation. Some provisions in a treaty may be self-executing even if other provisions are not. Specific provisions are more likely to be considered self-executing. A provision may be self-executing in the US even if it is not self-executing in other signatory nations.
+
+
+Recognition of States
+
+The process in which a state acknowledges another entity as a state is known as recognition. This can involve an overt statement or an action that implies an intent to recognize the entity as a state. Each state can make its own decision about whether recognition is appropriate, which can carry significant political weight. For example, recognition is usually required to establish sovereign and diplomatic immunities.
+
+International law contains two theories of recognition. The constitutive theory of recognition holds that a state does not exist until it receives recognition. By contrast, the declaratory theory of recognition holds that a state exists without recognition, which is merely an acknowledgment of an existing situation. The declaratory theory has become the prevailing view. That said, an entity likely has a stronger claim to statehood when it has received recognition from many other states. This is especially true if questions surround its ability to meet the criteria under the Montevideo Convention.
+
+Non-Recognition and Qualified Recognition
+
+Statehood does not rely on recognition, but sometimes a state may have a duty to refrain from recognizing another state or an alteration to a state. This situation usually arises when the state or altered state arose from illegitimate military actions, violations of human rights, or other clear infringements of international norms. The United Nations Security Council often sets an example for states on this issue. For example, it nullified the annexation of Kuwait by Iraq during the period preceding the Gulf War of 1991.
+
+In other cases, a state may not recognize an entity that meets the baseline criteria for statehood until it meets specific additional requirements. For example, states formed during the dissolution of the Soviet Union did not receive recognition from the European Community (the precursor to the European Union) until they committed to nuclear non-proliferation, minority rights, and respect for borders.
+
+The Solar System[c] is the gravitationally bound system of the Sun and the objects that orbit it. The largest of these objects are the eight planets, which in order from the Sun are four terrestrial planets (Mercury, Venus, Earth and Mars); two gas giants (Jupiter and Saturn); and two ice giants (Uranus and Neptune). The Solar System formed 4.6 billion years ago from the gravitational collapse of a giant interstellar molecular cloud.
+
+All four terrestrial planets belong to the inner Solar System (≤ 1.7 AU) and have a solid surface. Inversely, all four giant planets belong to the outer Solar System (≤ 30.5 AU) and do not have a definite surface, as they are mainly composed of gases and liquids. 99.86% of the Solar System's mass is in the Sun and nearly 90% of the remaining mass are in Jupiter and Saturn. There is a strong consensus among astronomers that the Solar System also has nine dwarf planets, which consist of one asteroid-belt object – Ceres; five Kuiper-belt objects – Pluto, Orcus, Haumea, Quaoar, and Makemake; and three scattered-disc objects – Gonggong, Eris, and Sedna.
+
+There are a vast number of smaller objects orbiting the Sun, called small Solar System bodies. This category includes asteroids, comets, centaurs, meteoroids and interplanetary dust clouds. Many of these objects are in the asteroid belt between the orbits of Mars and Jupiter (1.5–4.5 astronomical units, AU), and the Kuiper belt just outside Neptune's orbit (30–50 AU).[d] Six of the major planets, the six largest possible dwarf planets, and many of the smaller bodies are orbited by natural satellites, commonly called "moons" after Earth's Moon. Two natural satellites, Jupiter's moon Ganymede and Saturn's moon Titan, are larger than Mercury, the smallest terrestrial planet, though they are less massive.
+
+The Sun's stream of charged particles creates the heliosphere, which terminates where the pressure of the solar wind is equal to the surrounding interstellar medium, forming a boundary called the heliopause. The outermost region of the Solar System is the Oort cloud (from 2,000 to 50,000–200,000 AU), the source for long-period comets. The Solar System, which ends at the Sun's sphere of gravitational influence (50,000–200,000 AU), is embedded in the Local Cloud of the interstellar medium and orbits the Galactic Center. The closest star to the Solar System, Proxima Centauri, is 4.25 light years away.
+Formation and evolution
+Main article: Formation and evolution of the Solar System
+
+The Solar System formed 4.568 billion years ago from the gravitational collapse of a region within a large molecular cloud.[e] This initial cloud was likely several light-years across and probably birthed several stars.[5] As is typical of molecular clouds, this one consisted mostly of hydrogen, with some helium, and small amounts of heavier elements fused by previous generations of stars.[6]
+
+As the pre-solar nebula[6] collapsed, conservation of angular momentum caused it to rotate faster. The center, where most of the mass collected, became increasingly hotter than the surrounding disc.[5] As the contracting nebula rotated faster, it began to flatten into a protoplanetary disc with a diameter of roughly 200 AU (30 billion km; 19 billion mi)[5] and a hot, dense protostar at the center.[7][8] The planets formed by accretion from this disc,[9] in which dust and gas gravitationally attracted each other, coalescing to form ever larger bodies. Hundreds of protoplanets may have existed in the early Solar System, but they either merged or were destroyed or ejected, leaving the planets, dwarf planets, and leftover minor bodies.[10][11]
+Diagram of the early Solar System's protoplanetary disk, out of which Earth and other Solar System bodies formed
+
+Due to their higher boiling points, only metals and silicates could exist in solid form in the warm inner Solar System close to the Sun (within the frost line). They would eventually form the rocky planets of Mercury, Venus, Earth, and Mars. Because metallic elements only comprised a very small fraction of the solar nebula, the terrestrial planets could not grow very large.[10]
+
+The giant planets (Jupiter, Saturn, Uranus, and Neptune) formed further out, beyond the frost line, the point between the orbits of Mars and Jupiter where material is cool enough for volatile icy compounds to remain solid. The ices that formed these planets were more plentiful than the metals and silicates that formed the terrestrial inner planets, allowing them to grow massive enough to capture large atmospheres of hydrogen and helium, the lightest and most abundant elements.[10]
+
+Leftover debris that never became planets congregated in regions such as the asteroid belt, Kuiper belt, and Oort cloud.[10] The Nice model is an explanation for the creation of these regions and how the outer planets could have formed in different positions and migrated to their current orbits through various gravitational interactions.[12][further explanation needed]
+
+Within 50 million years, the pressure and density of hydrogen in the center of the protostar became great enough for it to begin thermonuclear fusion.[13] As helium accumulates at its core the Sun is growing brighter;[14] early in its main-sequence life its brightness was 70% that of what it is today.[15] The temperature, reaction rate, pressure, and density increased until hydrostatic equilibrium was achieved: the thermal pressure counterbalancing the force of gravity. At this point, the Sun became a main-sequence star.[16]
+
+The main-sequence phase, from beginning to end, will last about 10 billion years for the Sun compared to around two billion years for all other subsequent phases of the Sun's pre-remnant life combined.[17] Solar wind from the Sun created the heliosphere and swept away the remaining gas and dust from the protoplanetary disc into interstellar space.[14]
+
+The Solar System will remain roughly as it is known today until the hydrogen in the core of the Sun has been entirely converted to helium, which will occur roughly 5 billion years from now. This will mark the end of the Sun's main-sequence life. At that time, the core of the Sun will contract with hydrogen fusion occurring along a shell surrounding the inert helium, and the energy output will be greater than at present. The outer layers of the Sun will expand to roughly 260 times its current diameter, and the Sun will become a red giant. Because of its increased surface area, the surface of the Sun will be cooler (2,600 K (2,330 °C; 4,220 °F) at its coolest) than it is on the main sequence.[17]
+Overview of the evolution of the Sun, a G-type main-sequence star. Around 11 billion years after being formed by the Solar System's protoplanetary disk, the Sun will expand to become a red giant; Mercury, Venus and possibly the Earth will be swallowed.
+
+The expanding Sun is expected to vaporize Mercury as well as Venus, and render Earth uninhabitable (possibly destroying it as well). Eventually, the core will be hot enough for helium fusion; the Sun will burn helium for a fraction of the time it burned hydrogen in the core. The Sun is not massive enough to commence the fusion of heavier elements, and nuclear reactions in the core will dwindle. Its outer layers will be ejected into space, leaving behind a dense white dwarf, half the original mass of the Sun but only the size of Earth.[18] The ejected outer layers will form what is known as a planetary nebula, returning some of the material that formed the Sun—but now enriched with heavier elements like carbon—to the interstellar medium.[19]
+Structure and composition
+Further information: List of Solar System objects and Planet § Planetary attributes
+
+The word solar means "pertaining to the Sun", which is derived from the Latin word sol, meaning Sun.[20] The Sun is the dominant gravitational member of the Solar System, and its planetary system is maintained in a relatively stable, slowly evolving state by following isolated, gravitationally bound orbits around the Sun.[21]
+Orbits
+Animations of the Solar System's inner planets and outer planets orbiting; the latter animation is 100 times faster than the former. Jupiter is three times as far from the Sun as Mars.
+
+The planets and other large objects in orbit around the Sun lie near the plane of Earth's orbit, known as the ecliptic. Smaller icy objects such as comets frequently orbit at significantly greater angles to this plane.[22][23] Most of the planets in the Solar System have secondary systems of their own, being orbited by natural satellites called moons. Many of the largest natural satellites are in synchronous rotation, with one face permanently turned toward their parent. The four giant planets have planetary rings, thin bands of tiny particles that orbit them in unison.[24]
+
+As a result of the formation of the Solar System, planets and most other objects orbit the Sun in the same direction that the Sun is rotating. That is, counter-clockwise, as viewed from above Earth's north pole.[25] There are exceptions, such as Halley's Comet.[26] Most of the larger moons orbit their planets in prograde direction, matching the planetary rotation; Neptune's moon Triton is the largest to orbit in the opposite, retrograde manner.[27] Most larger objects rotate around their own axes in the prograde direction relative to their orbit, though the rotation of Venus is retrograde.[28]
+
+To a good first approximation, Kepler's laws of planetary motion describe the orbits of objects around the Sun.[29]: 433–437  These laws stipulate that each object travels along an ellipse with the Sun at one focus, which causes the body's distance from the Sun to vary over the course of its year. A body's closest approach to the Sun is called its perihelion, whereas its most distant point from the Sun is called its aphelion.[30]: 9-6  With the exception of Mercury, the orbits of the planets are nearly circular, but many comets, asteroids, and Kuiper belt objects follow highly elliptical orbits. Kepler's laws only account for the influence of the Sun's gravity upon an orbiting body, not the gravitational pulls of different bodies upon each other. On a human time scale, these additional perturbations can be accounted for using numerical models,[30]: 9-6  but the planetary system can change chaotically over billions of years.[31]
+
+The angular momentum of the Solar System is a measure of the total amount of orbital and rotational momentum possessed by all its moving components.[32] Although the Sun dominates the system by mass, it accounts for only about 2% of the angular momentum.[33][34] The planets, dominated by Jupiter, account for most of the rest of the angular momentum due to the combination of their mass, orbit, and distance from the Sun, with a possibly significant contribution from comets.[33]
+Composition
+
+The overall structure of the charted regions of the Solar System consists of the Sun, four smaller inner planets surrounded by a belt of mostly rocky asteroids, and four giant planets surrounded by the Kuiper belt of mostly icy objects. Astronomers sometimes informally divide this structure into separate regions. The inner Solar System includes the four terrestrial planets and the asteroid belt. The outer Solar System is beyond the asteroids, including the four giant planets.[35] Since the discovery of the Kuiper belt, the outermost parts of the Solar System are considered a distinct region consisting of the objects beyond Neptune.[36]
+
+The principal component of the Solar System is the Sun, a low-mass star that contains 99.86% of the system's known mass and dominates it gravitationally.[37] The Sun's four largest orbiting bodies, the giant planets, account for 99% of the remaining mass, with Jupiter and Saturn together comprising more than 90%. The remaining objects of the Solar System (including the four terrestrial planets, the dwarf planets, moons, asteroids, and comets) together comprise less than 0.002% of the Solar System's total mass.[f]
+
+The Sun is composed of roughly 98% hydrogen and helium,[41] as are Jupiter and Saturn.[42][43] A composition gradient exists in the Solar System, created by heat and light pressure from the early Sun; those objects closer to the Sun, which are more affected by heat and light pressure, are composed of elements with high melting points. Objects farther from the Sun are composed largely of materials with lower melting points.[44] The boundary in the Solar System beyond which those volatile substances could coalesce is known as the frost line, and it lies at roughly five times the Earth's distance from the Sun.[3]
+
+The objects of the inner Solar System are composed mostly of rocky materials,[45] such as silicates, iron or nickel.[46] Jupiter and Saturn are composed mainly of gases with extremely low melting points and high vapor pressure, such as hydrogen, helium, and neon.[46] Ices, like water, methane, ammonia, hydrogen sulfide, and carbon dioxide,[45] have a melting points of up to a few hundred kelvins.[46] They can be found as ices, liquids, or gases in various places in the Solar System.[46] Icy substances comprise the majority of the satellites of the giant planets, as well as most of Uranus and Neptune (the so-called "ice giants") and the numerous small objects that lie beyond Neptune's orbit.[45][47] Together, gases and ices are referred to as volatiles.[48]
+Distances and scales
+The Sun's, planets', dwarf planets' and moons' size to scale, labelled. Distance of objects is not to scale. The asteroid belt lies between the orbits of Mars and Jupiter, the Kuiper belt lies beyond Neptune's orbit.
+To-scale diagram of distance between planets, with the white bar showing orbital variations. The size of the planets is not to scale.
+
+The astronomical unit [AU] (150,000,000 km; 93,000,000 mi) would be the distance from the Earth to the Sun if the planet's orbit were perfectly circular.[49] For comparison, the radius of the Sun is 0.0047 AU (700,000 km; 400,000 mi).[50] Thus, the Sun occupies 0.00001% (10−5 %) of the volume of a sphere with a radius the size of Earth's orbit, whereas Earth's volume is roughly one millionth (10−6) that of the Sun. Jupiter, the largest planet, is 5.2 astronomical units (780,000,000 km; 480,000,000 mi) from the Sun and has a radius of 71,000 km (0.00047 AU; 44,000 mi), whereas the most distant planet, Neptune, is 30 AU (4.5×109 km; 2.8×109 mi) from the Sun.[43][51]
+
+With a few exceptions, the farther a planet or belt is from the Sun, the larger the distance between its orbit and the orbit of the next nearest object to the Sun. For example, Venus is approximately 0.33 AU farther out from the Sun than Mercury, whereas Saturn is 4.3 AU out from Jupiter, and Neptune lies 10.5 AU out from Uranus. Attempts have been made to determine a relationship between these orbital distances, like the Titius–Bode law[52] and Johannes Kepler's model based on the Platonic solids,[53] but ongoing discoveries have invalidated these hypotheses.[54]
+
+Some Solar System models attempt to convey the relative scales involved in the Solar System in human terms. Some are small in scale (and may be mechanical—called orreries)—whereas others extend across cities or regional areas.[55] The largest such scale model, the Sweden Solar System, uses the 110-metre (361 ft) Avicii Arena in Stockholm as its substitute Sun, and, following the scale, Jupiter is a 7.5-metre (25-foot) sphere at Stockholm Arlanda Airport, 40 km (25 mi) away, whereas the farthest current object, Sedna, is a 10 cm (4 in) sphere in Luleå, 912 km (567 mi) away.[56][57]
+
+If the Sun–Neptune distance is scaled to 100 metres (330 ft), then the Sun would be about 3 cm (1.2 in) in diameter (roughly two-thirds the diameter of a golf ball), the giant planets would be all smaller than about 3 mm (0.12 in), and Earth's diameter along with that of the other terrestrial planets would be smaller than a flea (0.3 mm or 0.012 in) at this scale.[58]
+Interplanetary environment
+The zodiacal light, caused by interplanetary dust
+
+The outermost layer of the Solar atmosphere is the heliosphere, which permeates much of the Solar planetary system. Along with light, the Sun radiates a continuous stream of charged particles (a plasma) called the solar wind. This stream of particles spreads outwards at speeds from 900,000 kilometres per hour (560,000 mph) to 2,880,000 kilometres per hour (1,790,000 mph),[59] filling the vacuum between the bodies of the Solar System. The result is a thin, dusty atmosphere, called the interplanetary medium, which extends to at least 100 AU (15 billion km; 9.3 billion mi). Beyond the heliosphere, large objects remain gravitationally bound to the sun, but the flow of matter in the interstellar medium homogenizes the distribution of micro-scale objects (see § Farthest regions).[60]
+
+The interplanetary medium is home to at least two disc-like regions of cosmic dust. The first, the zodiacal dust cloud, lies in the inner Solar System and causes the zodiacal light. It may have been formed by collisions within the asteroid belt brought on by gravitational interactions with the planets; a more recent proposed origin is the planet Mars.[61] The second dust cloud extends from about 10 AU (1.5 billion km; 930 million mi) to about 40 AU (6.0 billion km; 3.7 billion mi), and was probably created by collisions within the Kuiper belt.[62][63]
+
+Activity on the Sun's surface, such as solar flares and coronal mass ejections, disturbs the heliosphere, creating space weather and causing geomagnetic storms.[64] Coronal mass ejections and similar events blow a magnetic field and huge quantities of material from the surface of the Sun. The interaction of this magnetic field and material with Earth's magnetic field funnels charged particles into Earth's upper atmosphere, where its interactions create aurorae seen near the magnetic poles.[65] The largest stable structure within the heliosphere is the heliospheric current sheet, a spiral form created by the actions of the Sun's rotating magnetic field on the interplanetary medium.[66][67]
+Life habitability
+Main article: Planetary habitability in the Solar System
+
+Besides solar energy, the primary characteristic of the Solar System enabling the presence of life is the heliosphere and planetary magnetic fields (for those planets that have them). These magnetic fields partially shield the Solar System from high-energy interstellar particles called cosmic rays. The density of cosmic rays in the interstellar medium and the strength of the Sun's magnetic field change on very long timescales, so the level of cosmic-ray penetration in the Solar System varies, though by how much is unknown.[68]
+
+Earth's magnetic field also stops its atmosphere from being stripped away by the solar wind.[69] Venus and Mars do not have magnetic fields, and as a result the solar wind causes their atmospheres to gradually bleed away into space.[70]
+
+The zone of habitability of the Solar System is conventionally located in the inner Solar System, where planetary surface or atmospheric temperatures admit the possibility of liquid water.[71] Habitability might also be possible in subsurface oceans of various outer Solar System moons.[72]
+Sun
+Main article: Sun
+The Sun in true white color
+
+The Sun is the Solar System's star and by far its most massive component. Its large mass (332,900 Earth masses),[73] which comprises 99.86% of all the mass in the Solar System,[74] produces temperatures and densities in its core high enough to sustain nuclear fusion of hydrogen into helium.[75] This releases an enormous amount of energy, mostly radiated into space as electromagnetic radiation peaking in visible light.[76][77]
+
+Because the Sun fuses hydrogen into helium at its core, it is a main-sequence star. More specifically, it is a G2-type main-sequence star, where the type designation refers to its effective temperature. Hotter main-sequence stars are more luminous but shorter lived. The Sun's temperature is intermediate between that of the hottest stars and that of the coolest stars. Stars brighter and hotter than the Sun are rare, whereas substantially dimmer and cooler stars, known as red dwarfs, make up about 75% of the stars in the Milky Way.[78][79]
+
+The Sun is a population I star; it has a higher abundance of elements heavier than hydrogen and helium ("metals" in astronomical parlance) than the older population II stars.[80] Elements heavier than hydrogen and helium were formed in the cores of ancient and exploding stars, so the first generation of stars had to die before the universe could be enriched with these atoms. The oldest stars contain few metals, whereas stars born later have more. This higher metallicity is thought to have been crucial to the Sun's development of a planetary system because the planets form from the accretion of "metals".[81]
+Inner Solar System
+Overview of the Inner Solar System up to the Jovian System
+
+The inner Solar System is the region comprising the terrestrial planets and the asteroid belt.[82] Composed mainly of silicates and metals,[83] the objects of the inner Solar System are relatively close to the Sun; the radius of this entire region is less than the distance between the orbits of Jupiter and Saturn. This region is also within the frost line, which is a little less than 5 AU (750 million km; 460 million mi) from the Sun.[22]
+Inner planets
+Main article: Terrestrial planet
+The four terrestrial planets Mercury, Venus, Earth and Mars
+
+The four terrestrial or inner planets have dense, rocky compositions, few or no moons, and no ring systems. They are in hydrostatic equilibrium, forming a rounded shape, and have undergone planetary differentiation, causing chemical elements to accumulate at different radii. They are composed largely of refractory minerals such as silicates—which form their crusts and mantles—and metals such as iron and nickel which form their cores. Three of the four inner planets (Venus, Earth and Mars) have atmospheres substantial enough to generate weather; all have impact craters and tectonic surface features, such as rift valleys and volcanoes. The term inner planet should not be confused with inferior planet, which designates those planets that are closer to the Sun than Earth (i.e. Mercury and Venus).[84]
+Mercury
+Main article: Mercury (planet)
+
+Mercury (0.307–0.588 AU (45.9–88.0 million km; 28.5–54.7 million mi) from the Sun[85]) is the closest planet to the Sun. The smallest planet in the Solar System (0.055 MEarth), Mercury has no natural satellites. The dominant geological features are impact craters or basins with ejecta blankets, the remains of early volcanic activity including magma flows, and lobed ridges or rupes that were probably produced by a period of contraction early in the planet's history.[86] Mercury's very tenuous atmosphere consists of solar-wind particles trapped by Mercury's magnetic field, as well as atoms blasted off its surface by the solar wind.[87][88] Its relatively large iron core and thin mantle have not yet been adequately explained. Hypotheses include that its outer layers were stripped off by a giant impact, or that it was prevented from fully accreting by the young Sun's energy.[89][90] There have been searches for "Vulcanoids", asteroids in stable orbits between Mercury and the Sun, but none have been discovered.[91][92]
+Venus
+Main article: Venus
+
+Venus (0.718–0.728 AU (107.4–108.9 million km; 66.7–67.7 million mi) from the Sun[85]) is close in size to Earth (0.815 MEarth) and, like Earth, has a thick silicate mantle around an iron core, a substantial atmosphere, and evidence of internal geological activity. It is much drier than Earth, and its atmosphere is ninety times as dense. Venus has no natural satellites. It is the hottest planet, with surface temperatures over 400 °C (752 °F), mainly due to the amount of greenhouse gases in the atmosphere.[93] The planet has no magnetic field that would prevent the depletion of its substantial atmosphere, which suggests that its atmosphere is being replenished by volcanic eruptions.[94] A relatively young planetary surface displays extensive evidence of volcanic activity, but is devoid of plate tectonics. It may undergo resurfacing episodes on a time scale of 700 million years.[95]
+Earth
+Main article: Earth
+
+Earth (0.983–1.017 AU (147.1–152.1 million km; 91.4–94.5 million mi) from the Sun) is the largest and densest of the inner planets, the only one known to have current geological activity, and the only place in the universe where life is known to exist.[96] Its liquid hydrosphere is unique among the terrestrial planets, and it is the only planet where plate tectonics has been observed.[97] Earth's atmosphere is radically different from those of the other planets, having been altered by the presence of life to contain 21% free oxygen.[98][99] The planetary magnetosphere shields the surface from solar and cosmic radiation, limiting atmospheric stripping and maintaining habitability.[100] It has one natural satellite, the Moon, the only large satellite of a terrestrial planet in the Solar System.
+Mars
+Main article: Mars
+
+Mars (1.382–1.666 AU (206.7–249.2 million km; 128.5–154.9 million mi) from the Sun) is smaller than Earth and Venus (0.107 MEarth). It has an atmosphere of mostly carbon dioxide with a surface pressure of 6.1 millibars (0.088 psi; 0.18 inHg); roughly 0.6% of that of Earth but sufficient to support weather phenomena.[101] Its surface, peppered with volcanoes, such as Olympus Mons, and rift valleys, such as Valles Marineris, shows geological activity that may have persisted until as recently as 2 million years ago.[102] Its red color comes from iron oxide (rust) in its soil,[103] while the polar regions show white ice caps consisting largely of water.[104] Mars has two tiny natural satellites (Deimos and Phobos) thought to be either captured asteroids,[105] or ejected debris from a massive impact early in Mars's history.[106]
+Asteroid belt
+Main articles: Asteroid belt and Asteroid
+Linear map of the inner Solar System, showing many asteroid populations
+
+Asteroids except for the largest, Ceres, are classified as small Solar System bodies[g] and are composed mainly of carbonaceous, refractory rocky and metallic minerals, with some ice.[112][113] They range from a few metres to hundreds of kilometres in size. Asteroids smaller than one meter are usually called meteoroids and micrometeoroids (grain-sized), with the exact division between the two categories being debated over the years.[114] As of 2017, the IAU designates asteroids having a diameter between about 30 micrometres and 1 metre as micrometeoroids, and terms smaller particles "dust".[115]
+
+The asteroid belt occupies the orbit between Mars and Jupiter, between 2.3 and 3.3 AU (340 and 490 million km; 210 and 310 million mi) from the Sun. It is thought to be remnants from the Solar System's formation that failed to coalesce because of the gravitational interference of Jupiter.[116] The asteroid belt contains tens of thousands, possibly millions, of objects over one kilometre in diameter.[117] Despite this, the total mass of the asteroid belt is unlikely to be more than a thousandth of that of Earth.[40] The asteroid belt is very sparsely populated; spacecraft routinely pass through without incident.[118]
+Ceres
+Main article: Ceres (dwarf planet)
+
+Ceres (2.77 AU (414 million km; 257 million mi) from the Sun) is the largest asteroid, a protoplanet, and a dwarf planet.[g] It has a diameter of slightly under 1,000 km (620 mi) and a mass large enough for its own gravity to pull it into a spherical shape. Ceres was considered a planet when it was discovered in 1801, but as further observations revealed additional asteroids, it became common to consider it as one of the minor rather than major planets.[119] It was then reclassified again as a dwarf planet in 2006 when the IAU definition of planet was established.[120]: 218
+Pallas and Vesta
+Main articles: 2 Pallas and 4 Vesta
+
+Pallas (2.77 AU from the Sun) and Vesta (2.36 AU from the Sun) are the largest asteroids in the asteroid belt, after Ceres. They are the other two protoplanets that survive more or less intact. At about 520 km (320 mi) in diameter, they were large enough to have developed planetary geology in the past, but both have suffered large impacts and been battered out of being round.[121][122][123] Fragments from impacts upon these two bodies survive elsewhere in the asteroid belt, as the Pallas family and Vesta family. Both were considered planets upon their discoveries in 1802 and 1807 respectively, and like Ceres, eventually considered minor planets with the discovery of more asteroids. Some authors today have begun to consider Pallas and Vesta as planets again, along with Ceres, under geophysical definitions of the term.[108]
+Asteroid groups
+
+Asteroids in the asteroid belt are divided into asteroid groups and families based on their orbital characteristics. Kirkwood gaps are sharp dips in the distribution of asteroid orbits that correspond to orbital resonances with Jupiter.[124] Asteroid moons are asteroids that orbit larger asteroids. They are not as clearly distinguished as planetary moons, sometimes being almost as large as their partners (e.g. that of 90 Antiope). The asteroid belt includes main-belt comets, which may have been the source of Earth's water.[125]
+
+Jupiter trojans are located in either of Jupiter's L4 or L5 points (gravitationally stable regions leading and trailing a planet in its orbit); the term trojan is also used for small bodies in any other planetary or satellite Lagrange point. Hilda asteroids are in a 2:3 resonance with Jupiter; that is, they go around the Sun three times for every two Jupiter orbits.[126] The inner Solar System contains near-Earth asteroids, many of which cross the orbits of the inner planets.[127] Some of them are potentially hazardous objects.[128]
+Outer Solar System
+Plot of objects around the Kuiper belt and other asteroid populations, the J, S, U and N denotes Jupiter, Saturn, Uranus and Neptune
+
+The outer region of the Solar System is home to the giant planets and their large moons. The centaurs and many short-period comets also orbit in this region. Due to their greater distance from the Sun, the solid objects in the outer Solar System contain a higher proportion of volatiles, such as water, ammonia, and methane than those of the inner Solar System because the lower temperatures allow these compounds to remain solid, without significant rates of sublimation.[10]
+Outer planets
+Main article: Giant planet
+The outer planets Jupiter, Saturn, Uranus and Neptune, compared to the inner planets Earth, Venus, Mars, and Mercury at the bottom right
+
+The four outer planets, also called giant planets or Jovian planets, collectively make up 99% of the mass known to orbit the Sun.[f] Jupiter and Saturn are together more than 400 times the mass of Earth and consist overwhelmingly of the gases hydrogen and helium, hence their designation as gas giants.[129] Uranus and Neptune are far less massive—less than 20 Earth masses (MEarth) each—and are composed primarily of ice. For these reasons, some astronomers suggest they belong in their own category, ice giants.[130] All four giant planets have rings, although only Saturn's ring system is easily observed from Earth. The term superior planet designates planets outside Earth's orbit and thus includes both the outer planets and Mars.[84]
+
+The ring–moon systems of Jupiter, Saturn, and Uranus are like miniature versions of the Solar System; that of Neptune is significantly different, having been disrupted by the capture of its largest moon Triton.[131]
+Jupiter
+Main article: Jupiter
+
+Jupiter (4.951–5.457 AU (740.7–816.4 million km; 460.2–507.3 million mi) from the Sun[85]), at 318 MEarth, is 2.5 times the mass of all the other planets put together. It is composed largely of hydrogen and helium. Jupiter's strong internal heat creates semi-permanent features in its atmosphere, such as cloud bands and the Great Red Spot. The planet possesses a 4.2–14 Gauss strength magnetosphere that spans 22–29 million km, making it, in certain respects, the largest object in the Solar System.[132] Jupiter has 95 known satellites. The four largest, Ganymede, Callisto, Io, and Europa, are called the Galilean moons: they show similarities to the terrestrial planets, such as volcanism and internal heating.[133] Ganymede, the largest satellite in the Solar System, is larger than Mercury; Callisto is almost as large.[134]
+Saturn
+Main article: Saturn
+
+Saturn (9.075–10.07 AU (1.3576–1.5065 billion km; 843.6–936.1 million mi) from the Sun[85]), distinguished by its extensive ring system, has several similarities to Jupiter, such as its atmospheric composition and magnetosphere. Although Saturn has 60% of Jupiter's volume, it is less than a third as massive, at 95 MEarth. Saturn is the only planet of the Solar System that is less dense than water. The rings of Saturn are made up of small ice and rock particles.[135] Saturn has 145 confirmed satellites composed largely of ice. Two of these, Titan and Enceladus, show signs of geological activity;[136] they, as well as five other Saturnian moons (Iapetus, Rhea, Dione, Tethys, and Mimas), are large enough to be round. Titan, the second-largest moon in the Solar System, is bigger than Mercury and the only satellite in the Solar System to have a substantial atmosphere.[137][138]
+Uranus
+Main article: Uranus
+
+Uranus (18.27–20.06 AU (2.733–3.001 billion km; 1.698–1.865 billion mi) from the Sun[85]), at 14 MEarth, has the lowest mass of the outer planets. Uniquely among the planets, it orbits the Sun on its side; its axial tilt is over ninety degrees to the ecliptic. This gives the planet extreme seasonal variation as each pole points toward and then away from the Sun.[139] It has a much colder core than the other giant planets and radiates very little heat into space.[140] As a consequence, it has the coldest planetary atmosphere in the Solar System.[141] Uranus has 27 known satellites, the largest ones being Titania, Oberon, Umbriel, Ariel, and Miranda.[142] Like the other giant planets, it possesses a ring system and magnetosphere.[143]
+Neptune
+Main article: Neptune
+
+Neptune (29.89–30.47 AU (4.471–4.558 billion km; 2.778–2.832 billion mi) from the Sun[85]), though slightly smaller than Uranus, is more massive (17 MEarth) and hence more dense. It radiates more internal heat than Uranus, but not as much as Jupiter or Saturn.[144] Neptune has 14 known satellites. The largest, Triton, is geologically active, with geysers of liquid nitrogen.[145] Triton is the only large satellite with a retrograde orbit, which indicates that it did not form with Neptune, but was probably captured from the Kuiper belt.[146] Neptune is accompanied in its orbit by several minor planets, termed Neptune trojans, that either lead or trail the planet by about one-sixth of the way around the Sun, positions known as Lagrange points.[147]
+Centaurs
+Main article: Centaur (small Solar System body)
+
+The centaurs are icy comet-like bodies whose orbits have semi-major axes greater than Jupiter's (5.5 AU (820 million km; 510 million mi)) and less than Neptune's (30 AU (4.5 billion km; 2.8 billion mi)). These are former Kuiper belt and scattered disc objects that were gravitationally perturbed closer to the Sun by the outer planets, and are expected to become comets or get ejected out of the Solar System.[39] While most centaurs are inactive and asteroid-like, some exhibit clear cometary activity, such as the first centaur discovered, 2060 Chiron, which has been classified as a comet (95P) because it develops a coma just as comets do when they approach the Sun.[148] The largest known centaur, 10199 Chariklo, has a diameter of about 250 km (160 mi) and is one of the only few minor planets known to possess a ring system.[149][150]
+Comets
+Main article: Comet
+Comet Hale–Bopp seen in 1997
+
+Comets are small Solar System bodies,[g] typically only a few kilometres across, composed largely of volatile ices. They have highly eccentric orbits, generally a perihelion within the orbits of the inner planets and an aphelion far beyond Pluto. When a comet enters the inner Solar System, its proximity to the Sun causes its icy surface to sublimate and ionise, creating a coma: a long tail of gas and dust often visible to the naked eye.[151]
+
+Short-period comets have orbits lasting less than two hundred years. Long-period comets have orbits lasting thousands of years. Short-period comets are thought to originate in the Kuiper belt, whereas long-period comets, such as Hale–Bopp, are thought to originate in the Oort cloud. Many comet groups, such as the Kreutz sungrazers, formed from the breakup of a single parent.[152] Some comets with hyperbolic orbits may originate outside the Solar System, but determining their precise orbits is difficult.[153] Old comets whose volatiles have mostly been driven out by solar warming are often categorised as asteroids.[154]
+Trans-Neptunian region
+Distribution and size of trans-Neptunian objects. The horizontal axis stand for the semi-major axis of the body, the vertical axis stands for the inclination of the orbit, and the size of the circle stands for the relative size of the object.
+Size comparison of some large TNOs with Earth: Pluto and its moons, Eris, Makemake, Haumea, Sedna, Gonggong, Quaoar, Orcus, Salacia, and 2002 MS4.
+
+Beyond the orbit of Neptune lies the area of the "trans-Neptunian region", with the doughnut-shaped Kuiper belt, home of Pluto and several other dwarf planets, and an overlapping disc of scattered objects, which is tilted toward the plane of the Solar System and reaches much further out than the Kuiper belt. The entire region is still largely unexplored. It appears to consist overwhelmingly of many thousands of small worlds—the largest having a diameter only a fifth that of Earth and a mass far smaller than that of the Moon—composed mainly of rock and ice. This region is sometimes described as the "third zone of the Solar System", enclosing the inner and the outer Solar System.[155]
+Kuiper belt
+Main article: Kuiper belt
+
+The Kuiper belt is a great ring of debris similar to the asteroid belt, but consisting mainly of objects composed primarily of ice.[156] It extends between 30 and 50 AU (4.5 and 7.5 billion km; 2.8 and 4.6 billion mi) from the Sun. It is composed mainly of small Solar System bodies, although the largest few are probably large enough to be dwarf planets.[157] There are estimated to be over 100,000 Kuiper belt objects with a diameter greater than 50 km (30 mi), but the total mass of the Kuiper belt is thought to be only a tenth or even a hundredth the mass of Earth.[39] Many Kuiper belt objects have satellites,[158] and most have orbits that are substantially inclined (~10°) to the plane of the ecliptic.[159]
+
+The Kuiper belt can be roughly divided into the "classical" belt and the resonant trans-Neptunian objects.[156] The latter have orbits whose periods are in a simple ratio to that of Neptune: for example, going around the Sun twice for every three times that Neptune does, or once for every two. The classical belt consists of objects having no resonance with Neptune, and extends from roughly 39.4 to 47.7 AU (5.89 to 7.14 billion km; 3.66 to 4.43 billion mi).[160] Members of the classical Kuiper belt are sometimes called "cubewanos", after the first of their kind to be discovered, originally designated 1992 QB1; they are still in near primordial, low-eccentricity orbits.[161]
+Pluto and Charon
+Main articles: Pluto and Charon (moon)
+
+The dwarf planet Pluto (with an average orbit of 39 AU (5.8 billion km; 3.6 billion mi) from the Sun) is the largest known object in the Kuiper belt. When discovered in 1930, it was considered to be the ninth planet; this changed in 2006 with the adoption of a formal definition of planet. Pluto has a relatively eccentric orbit inclined 17 degrees to the ecliptic plane and ranging from 29.7 AU (4.44 billion km; 2.76 billion mi) from the Sun at perihelion (within the orbit of Neptune) to 49.5 AU (7.41 billion km; 4.60 billion mi) at aphelion. Pluto has a 2:3 resonance with Neptune, meaning that Pluto orbits twice round the Sun for every three Neptunian orbits. Kuiper belt objects whose orbits share this resonance are called plutinos.[162]
+
+Charon, the largest of Pluto's moons, is sometimes described as part of a binary system with Pluto, as the two bodies orbit a barycenter of gravity above their surfaces (i.e. they appear to "orbit each other"). Beyond Charon, four much smaller moons, Styx, Nix, Kerberos, and Hydra, orbit Pluto.[163]
+Others
+
+Besides Pluto, astronomers generally agree that at least four other Kuiper belt objects are dwarf planets,[157] though there is some doubt for Orcus,[164] and additional bodies have also been proposed:[165]
+
+    Makemake (45.79 AU average from the Sun), although smaller than Pluto, is the largest known object in the classical Kuiper belt (that is, a Kuiper belt object not in a confirmed resonance with Neptune). Makemake is the brightest object in the Kuiper belt after Pluto. Discovered in 2005, it was officially named in 2009.[166] Its orbit is far more inclined than Pluto's, at 29°.[167] It has one known moon.[168]
+    Haumea (43.13 AU average from the Sun) is in an orbit similar to Makemake, except that it is in a temporary 7:12 orbital resonance with Neptune.[169] Like Makemake, it was discovered in 2005.[170] Uniquely among the dwarf planets, Haumea possess a ring system, two known moons named Hiʻiaka and Namaka, and rotates so quickly (once every 3.9 hours) that it is stretched into an ellipsoid. It is part of a collisional family of Kuiper belt objects that share similar orbits, which suggests a giant collision took place on Haumea and ejected its fragments into space billions of years ago.[171]
+    Quaoar (43.69 AU average from the Sun) is the second-largest known object in the classical Kuiper belt, after Makemake. Its orbit is significantly less eccentric and inclined than those of Makemake or Haumea.[169] It possesses a ring system and one known moon, Weywot.[172]
+    Orcus (39.40 AU average from the Sun) is in the same 2:3 orbital resonance with Neptune as Pluto, and is the largest such object after Pluto itself.[169] Its eccentricity and inclination are similar to Pluto's, but its perihelion lies about 120° from that of Pluto. Thus, the phase of Orcus's orbit is opposite to Pluto's: Orcus is at aphelion (most recently in 2019) around when Pluto is at perihelion (most recently in 1989) and vice versa.[173] For this reason, it has been called the anti-Pluto.[174][175] It has one known moon, Vanth.[176]
+
+Scattered disc
+Main article: Scattered disc
+The orbital eccentricities and inclinations of the scattered disc population compared to the classical and resonant Kuiper belt objects
+
+The scattered disc, which overlaps the Kuiper belt but extends out to near 500 AU, is thought to be the source of short-period comets. Scattered-disc objects are believed to have been perturbed into erratic orbits by the gravitational influence of Neptune's early outward migration. Most scattered disc objects (SDOs) have perihelia within the Kuiper belt but aphelia far beyond it (some more than 150 AU from the Sun). SDOs' orbits can also be inclined up to 46.8° from the ecliptic plane.[177] Some astronomers consider the scattered disc to be merely another region of the Kuiper belt and describe scattered-disc objects as "scattered Kuiper belt objects".[178] Some astronomers also classify centaurs as inward-scattered Kuiper belt objects along with the outward-scattered residents of the scattered disc.[179]
+Eris and Gonggong
+
+Eris (67.78 AU average from the Sun) is the largest known scattered disc object, and caused a debate about what constitutes a planet, because it is 25% more massive than Pluto[180] and about the same diameter. It is the most massive of the known dwarf planets. It has one known moon, Dysnomia. Like Pluto, its orbit is highly eccentric, with a perihelion of 38.2 AU (roughly Pluto's distance from the Sun) and an aphelion of 97.6 AU, and steeply inclined to the ecliptic plane at an angle of 44°.[181]
+
+Gonggong (67.38 AU average from the Sun) is another dwarf planet in a comparable orbit to Eris, except that it is in a 3:10 resonance with Neptune.[182] It has one known moon, Xiangliu.[183]
+Farthest regions
+
+The point at which the Solar System ends and interstellar space begins is not precisely defined because its outer boundaries are shaped by two forces: the solar wind and the Sun's gravity. The limit of the solar wind's influence is roughly four times Pluto's distance from the Sun; this heliopause, the outer boundary of the heliosphere, is considered the beginning of the interstellar medium.[60] The Sun's Hill sphere, the effective range of its gravitational dominance, is thought to extend up to a thousand times farther and encompasses the hypothetical Oort cloud.[184]
+Edge of the heliosphere
+Main article: Heliosheath
+Artistic depiction of the Solar System's heliosphere
+
+The Sun's stellar-wind bubble, the heliosphere, a region of space dominated by the Sun, has its boundary at the termination shock, which is roughly 80–100 AU from the Sun upwind of the interstellar medium and roughly 200 AU from the Sun downwind.[185] Here the solar wind collides with the interstellar medium[186] and dramatically slows, condenses and becomes more turbulent,[185] forming a great oval structure known as the heliosheath. This structure has been theorized to look and behave very much like a comet's tail, extending outward for a further 40 AU on the upwind side but tailing many times that distance downwind.[187] Evidence from the Cassini and Interstellar Boundary Explorer spacecraft has suggested that it is forced into a bubble shape by the constraining action of the interstellar magnetic field,[188][189] but the actual shape remains unknown.[190]
+
+The outer boundary of the heliosphere, the heliopause, is the point at which the solar wind finally terminates and is the beginning of interstellar space.[60] Voyager 1 and Voyager 2 passed the termination shock and entered the heliosheath at 94 and 84 AU from the Sun, respectively.[191][192] Voyager 1 was reported to have crossed the heliopause in August 2012, and Voyager 2 in December 2018.[193][194]
+
+The shape and form of the outer edge of the heliosphere is likely affected by the fluid dynamics of interactions with the interstellar medium as well as solar magnetic fields prevailing to the south, e.g. it is bluntly shaped with the northern hemisphere extending 9 AU farther than the southern hemisphere.[185] Beyond the heliopause, at around 230 AU, lies the bow shock: a plasma "wake" left by the Sun as it travels through the Milky Way.[195]
+Detached objects
+The detached object Sedna and its orbit within the Solar System
+Main articles: Detached object and Sednoid
+
+Sedna (with an average orbit of 520 AU from the Sun) is a large, reddish object with a gigantic, highly elliptical orbit that takes it from about 76 AU at perihelion to 940 AU at aphelion and takes 11,400 years to complete. Mike Brown, who discovered the object in 2003, asserts that it cannot be part of the scattered disc or the Kuiper belt because its perihelion is too distant to have been affected by Neptune's migration. He and other astronomers consider it to be the first in an entirely new population, sometimes termed "distant detached objects" (DDOs), which also may include the object 2000 CR105, which has a perihelion of 45 AU, an aphelion of 415 AU, and an orbital period of 3,420 years.[196] Brown terms this population the "inner Oort cloud" because it may have formed through a similar process, although it is far closer to the Sun.[197] Sedna is very likely a dwarf planet, though its shape has yet to be determined. The second unequivocally detached object, with a perihelion farther than Sedna's at roughly 81 AU, is 2012 VP113, discovered in 2012. Its aphelion is only about half that of Sedna's, at 458 AU.[198][199]
+Oort cloud
+Main article: Oort cloud
+
+The Oort cloud is a hypothetical spherical cloud of up to a trillion icy objects that is thought to be the source for all long-period comets and to surround the Solar System at roughly 50,000 AU (around 1 light-year (ly)) from the Sun, and possibly to as far as 100,000 AU (1.87 ly). It is thought to be composed of comets that were ejected from the inner Solar System by gravitational interactions with the outer planets. Oort cloud objects move very slowly, and can be perturbed by infrequent events, such as collisions, the gravitational effects of a passing star, or the galactic tide, the tidal force exerted by the Milky Way.[200][201]
+Boundaries
+See also: Planets beyond Neptune, Planet Nine, and List of Solar System objects by greatest aphelion
+
+Much of the Solar System is still unknown. The Sun's gravitational field is estimated to dominate the gravitational forces of surrounding stars out to about two light-years (125,000 AU). Lower estimates for the radius of the Oort cloud, by contrast, do not place it farther than 50,000 AU.[202] Most of the mass is orbiting in the region between 3,000 and 100,000 AU.[203] Despite discoveries such as Sedna, the region between the Kuiper belt and the Oort cloud, an area tens of thousands of AU in radius, is still virtually unmapped. Learning about this region of space is difficult, because it depends upon inferences from those few objects whose orbits happen to be perturbed such that they fall closer to the Sun, and even then, detecting these objects has often been possible only when they happened to become bright enough to register as comets.[204] Objects may yet be discovered in the Solar System's uncharted regions.[205] The furthest known objects, such as Comet West, have aphelia around 70,000 AU from the Sun.[206]
+Comparison with other star systems
+1e, 1f and 1g is in the habitable zone
+Habitable zones of TRAPPIST-1 and the Solar System; here, the TRAPPIST-1 system is enlarged 25 times. The displayed planetary surfaces on TRAPPIST-1 are speculative.
+
+Compared to many extrasolar systems, the Solar System stands out in lacking planets interior to the orbit of Mercury.[207][208] The known Solar System also lacks super-Earths, planets between one and ten times as massive as the Earth,[207] although the hypothetical Planet Nine, if it does exist, could be a super-Earth orbiting in the outer Solar System.[209] Uncommonly, it has only small rocky planets and large gas giants; elsewhere planets of intermediate size are typical—both rocky and gas—so there is no "gap" as seen between the size of Earth and of Neptune (with a radius 3.8 times as large). As many of these super-Earths are closer to their respective stars than Mercury is to the Sun, a hypothesis has arisen that all planetary systems start with many close-in planets, and that typically a sequence of their collisions causes consolidation of mass into few larger planets, but in case of the Solar System the collisions caused their destruction and ejection.[207][210]
+
+The orbits of Solar System planets are nearly circular. Compared to other systems, they have smaller orbital eccentricity.[207] Although there are attempts to explain it partly with a bias in the radial-velocity detection method and partly with long interactions of a quite high number of planets, the exact causes remain undetermined.[207][211]
+Location
+Celestial neighborhood
+Diagram of the Local Interstellar Cloud, the G-Cloud and surrounding stars. As of 2022, the precise location of the Solar System in the clouds is an open question in astronomy.[212]
+
+The Solar System is surrounded by the Local Interstellar Cloud, although it is not clear if it is embedded in the Local Interstellar Cloud or if it lies just outside the cloud's edge.[213][214] Multiple other interstellar clouds also exist in the region within 300 light-years of the Sun, known as the Local Bubble.[214] The latter feature is an hourglass-shaped cavity or superbubble in the interstellar medium roughly 300 light-years across. The bubble is suffused with high-temperature plasma, suggesting that it may be the product of several recent supernovae.[215]
+
+The Local Bubble is a small superbubble compared to the neighboring wider Radcliffe Wave and Split linear structures (formerly Gould Belt), each of which are some thousands of light-years in length.[216] All these structures are part of the Orion Arm, which contains most of the stars in the Milky Way that are visible to the unaided eye. The density of all matter in the local neighborhood is 0.097±0.013 M☉·pc−3.[217]
+
+Within ten light-years of the Sun there are relatively few stars, the closest being the triple star system Alpha Centauri, which is about 4.4 light-years away and may be in the Local Bubble's G-Cloud.[218] Alpha Centauri A and B are a closely tied pair of Sun-like stars, whereas the closest star to Earth, the small red dwarf Proxima Centauri, orbits the pair at a distance of 0.2 light-year. In 2016, a potentially habitable exoplanet was found to be orbiting Proxima Centauri, called Proxima Centauri b, the closest confirmed exoplanet to the Sun.[219]
+
+The next closest known fusors to the Sun are the red dwarfs Barnard's Star (at 5.9 ly), Wolf 359 (7.8 ly), and Lalande 21185 (8.3 ly).[220] The nearest brown dwarfs belong to the binary Luhman 16 system (6.6 ly), and the closest known rogue or free-floating planetary-mass object at less than 10 Jupiter masses is the sub-brown dwarf WISE 0855−0714 (7.4 ly).[221]
+
+Just beyond at 8.6 ly lies Sirius, the brightest star in Earth's night sky, with roughly twice the Sun's mass, orbited by the closest white dwarf to Earth, Sirius B. Other stars within ten light-years are the binary red-dwarf system Gliese 65 (8.7 ly) and the solitary red dwarf Ross 154 (9.7 ly).[222][223] The closest solitary Sun-like star to the Solar System is Tau Ceti at 11.9 light-years. It has roughly 80% of the Sun's mass but only about half of its luminosity.[224]
+
+The nearest and unaided-visible group of stars beyond the immediate celestial neighborhood is the Ursa Major moving group at roughly 80 light-years, which is within the Local Bubble, like the nearest as well as unaided-visible star cluster the Hyades, which lie at its edge. The closest star-forming regions are the Corona Australis Molecular Cloud, the Rho Ophiuchi cloud complex and the Taurus molecular cloud; the latter lies just beyond the Local Bubble and is part of the Radcliffe wave.[225]
+Galactic position and orbit
+See also: Location of Earth, Galactic year, and Orbit of the Sun
+Diagram of the Milky Way, with galactic features and the relative position of the Solar System labelled.
+
+The Solar System is located in the Milky Way, a barred spiral galaxy with a diameter of about 100,000 light-years containing more than 100 billion stars.[226] The Sun is part of one of the Milky Way's outer spiral arms, known as the Orion–Cygnus Arm or Local Spur.[227]
+
+The Sun orbits close to circular the Galactic Center (where the supermassive black hole Sagittarius A* resides) at a distance of 26,660 light-years,[228] orbiting at roughly the same speed as that of the spiral arms.[229][230] Therefore, the Sun passes through arms only rarely.
+
+Its speed around the center of the Milky Way is about 220 km/s, so that it completes one revolution every 240 million years.[226] This revolution is known as the Solar System's galactic year.[231] The solar apex, the direction of the Sun's path through interstellar space, is near the constellation Hercules in the direction of the current location of the bright star Vega.[232] The plane of the ecliptic lies at an angle of about 60° to the galactic plane.[h]
+Habitability of galactic position and orbit
+
+The Solar System's location in the Milky Way is a factor in the evolutionary history of life on Earth. Spiral arms are home to a far larger concentration of supernovae, gravitational instabilities, and radiation that could disrupt the Solar System, but since Earth stays in the Local Spur and therefore does not pass frequently through spiral arms, this has given Earth long periods of stability for life to evolve.[229] However, the changing position of the Solar System relative to other parts of the Milky Way could explain periodic extinction events on Earth, according to the Shiva hypothesis or related theories, but this remains controversial.[234][235]
+
+The Solar System lies well outside the star-crowded environs of the Galactic Center. Near the center, gravitational tugs from nearby stars could perturb bodies in the Oort cloud and send many comets into the inner Solar System, producing collisions with potentially catastrophic implications for life on Earth. The intense radiation of the Galactic Center could also interfere with the development of complex life.[229] Stellar flybys that pass within 0.8 light-years of the Sun occur roughly once every 100,000 years. The closest well-measured approach was Scholz's Star, which approached to 52+23
+−14 kAU of the Sun some 70+15
+−10 kya, likely passing through the outer Oort cloud.[236]
+Humanity's perspective
+Main article: Discovery and exploration of the Solar System
+The motion of 'lights' moving across the sky is the basis of the classical definition of planets: wandering stars.
+
+Humanity's knowledge of the Solar System has grown incrementally over the centuries. Up to the Late Middle Ages–Renaissance, astronomers from Europe to India believed Earth to be stationary at the center of the universe[237] and categorically different from the divine or ethereal objects that moved through the sky. Although the Greek philosopher Aristarchus of Samos had speculated on a heliocentric reordering of the cosmos, Nicolaus Copernicus was the first person known to have developed a mathematically predictive heliocentric system.[238][239] Heliocentrism did not triumph immediately over geocentrism, but the work of Copernicus had its champions, notably Johannes Kepler. Using a heliocentric model that improved upon Copernicus by allowing orbits to be elliptical, and the precise observational data of Tycho Brahe, Kepler produced the Rudolphine Tables, which enabled accurate computations of the positions of the then-known planets. Pierre Gassendi used them to predict a transit of Mercury in 1631, and Jeremiah Horrocks did the same for a transit of Venus in 1639. This provided a strong vindication of heliocentrism and Kepler's elliptical orbits.[240][241]
+
+In the 17th century, Galileo publicized the use of the telescope in astronomy; he and Simon Marius independently discovered that Jupiter had four satellites in orbit around it.[242] Christiaan Huygens followed on from these observations by discovering Saturn's moon Titan and the shape of the rings of Saturn.[243] In 1677, Edmond Halley observed a transit of Mercury across the Sun, leading him to realize that observations of the solar parallax of a planet (more ideally using the transit of Venus) could be used to trigonometrically determine the distances between Earth, Venus, and the Sun.[244] Halley's friend Isaac Newton, in his magisterial Principia Mathematica of 1687, demonstrated that celestial bodies are not quintessentially different from Earthly ones: the same laws of motion and of gravity apply on Earth and in the skies.[29]: 142
+
+The term "Solar System" entered the English language by 1704, when John Locke used it to refer to the Sun, planets, and comets.[245] In 1705, Halley realized that repeated sightings of a comet were of the same object, returning regularly once every 75–76 years. This was the first evidence that anything other than the planets repeatedly orbited the Sun,[246] though Seneca had theorized this about comets in the 1st century.[247] Careful observations of the 1769 transit of Venus allowed astronomers to calculate the average Earth–Sun distance as 93,726,900 miles (150,838,800 km), only 0.8% greater than the modern value.[248] Uranus, having occasionally been observed since antiquity, was recognized to be a planet orbiting beyond Saturn by 1783.[249] In 1838, Friedrich Bessel successfully measured a stellar parallax, an apparent shift in the position of a star created by Earth's motion around the Sun, providing the first direct, experimental proof of heliocentrism.[250] Neptune was identified as a planet some years later, in 1846, thanks to its gravitational pull causing a slight but detectable variation in the orbit of Uranus.[251]
+
+In the 20th century, humans began their space exploration around the Solar System, starting with placing telescopes in space.[252] Since then, humans have landed on the Moon during the Apollo program; the Apollo 13 mission marked the furthest any human has been away from Earth at 400,171 kilometers (248,655 mi).[253] All eight planets and two dwarf planets have been visited by space probes. This began with Mariner 2's fly-by of Venus in 1962, while Mariner 9 mission to Mars was the first to orbit another planet in 1971. The outer planets were first visited by Pioneer 10's encounter with Jupiter, and Pioneer 11's encounter with Saturn. The remaining gas giants were first visited by the Voyager spacecraft, one of which (Voyager 1) is the furthest object made by humankind and the first in interstellar space.[254] In addition, probes have also returned samples from comets[255] and asteroids,[256] as well as flown through the Sun's corona[257] and made fly-bys of Kuiper belt objects.[258] Six of the planets (all but Uranus and Neptune) have or had a dedicated orbiter.[259]
+See also
+
+Quantum mechanics is a fundamental theory in physics that describes the behavior of nature at the scale of atoms and subatomic particles.[2]: 1.1  It is the foundation of all quantum physics including quantum chemistry, quantum field theory, quantum technology, and quantum information science.
+
+Classical physics, the collection of theories that existed before the advent of quantum mechanics, describes many aspects of nature at an ordinary (macroscopic) scale, but is not sufficient for describing them at small (atomic and subatomic) scales. Most theories in classical physics can be derived from quantum mechanics as an approximation valid at large (macroscopic) scale.[3]
+
+Quantum mechanics differs from classical physics in that energy, momentum, angular momentum, and other quantities of a bound system are restricted to discrete values (quantization); measurements of systems show characteristics of both particles and waves (wave–particle duality); and there are limits to how accurately the value of a physical quantity can be predicted prior to its measurement, given a complete set of initial conditions (the uncertainty principle).
+
+Quantum mechanics arose gradually from theories to explain observations that could not be reconciled with classical physics, such as Max Planck's solution in 1900 to the black-body radiation problem, and the correspondence between energy and frequency in Albert Einstein's 1905 paper, which explained the photoelectric effect. These early attempts to understand microscopic phenomena, now known as the "old quantum theory", led to the full development of quantum mechanics in the mid-1920s by Niels Bohr, Erwin Schrödinger, Werner Heisenberg, Max Born, Paul Dirac and others. The modern theory is formulated in various specially developed mathematical formalisms. In one of them, a mathematical entity called the wave function provides information, in the form of probability amplitudes, about what measurements of a particle's energy, momentum, and other physical properties may yield.
+Overview and fundamental concepts
+
+Quantum mechanics allows the calculation of properties and behaviour of physical systems. It is typically applied to microscopic systems: molecules, atoms and sub-atomic particles. It has been demonstrated to hold for complex molecules with thousands of atoms,[4] but its application to human beings raises philosophical problems, such as Wigner's friend, and its application to the universe as a whole remains speculative.[5] Predictions of quantum mechanics have been verified experimentally to an extremely high degree of accuracy. For example, the refinement of quantum mechanics for the interaction of light and matter, known as quantum electrodynamics (QED), has been shown to agree with experiment to within 1 part in 108 for some atomic properties.
+
+A fundamental feature of the theory is that it usually cannot predict with certainty what will happen, but only give probabilities. Mathematically, a probability is found by taking the square of the absolute value of a complex number, known as a probability amplitude. This is known as the Born rule, named after physicist Max Born. For example, a quantum particle like an electron can be described by a wave function, which associates to each point in space a probability amplitude. Applying the Born rule to these amplitudes gives a probability density function for the position that the electron will be found to have when an experiment is performed to measure it. This is the best the theory can do; it cannot say for certain where the electron will be found. The Schrödinger equation relates the collection of probability amplitudes that pertain to one moment of time to the collection of probability amplitudes that pertain to another.
+
+One consequence of the mathematical rules of quantum mechanics is a tradeoff in predictability between different measurable quantities. The most famous form of this uncertainty principle says that no matter how a quantum particle is prepared or how carefully experiments upon it are arranged, it is impossible to have a precise prediction for a measurement of its position and also at the same time for a measurement of its momentum.
+
+Another consequence of the mathematical rules of quantum mechanics is the phenomenon of quantum interference, which is often illustrated with the double-slit experiment. In the basic version of this experiment, a coherent light source, such as a laser beam, illuminates a plate pierced by two parallel slits, and the light passing through the slits is observed on a screen behind the plate.[6]: 102–111 [2]: 1.1–1.8  The wave nature of light causes the light waves passing through the two slits to interfere, producing bright and dark bands on the screen – a result that would not be expected if light consisted of classical particles.[6] However, the light is always found to be absorbed at the screen at discrete points, as individual particles rather than waves; the interference pattern appears via the varying density of these particle hits on the screen. Furthermore, versions of the experiment that include detectors at the slits find that each detected photon passes through one slit (as would a classical particle), and not through both slits (as would a wave).[6]: 109 [7][8] However, such experiments demonstrate that particles do not form the interference pattern if one detects which slit they pass through. This behavior is known as wave–particle duality. In addition to light, electrons, atoms, and molecules are all found to exhibit the same dual behavior when fired towards a double slit.[2]
+
+Another non-classical phenomenon predicted by quantum mechanics is quantum tunnelling: a particle that goes up against a potential barrier can cross it, even if its kinetic energy is smaller than the maximum of the potential.[9] In classical mechanics this particle would be trapped. Quantum tunnelling has several important consequences, enabling radioactive decay, nuclear fusion in stars, and applications such as scanning tunnelling microscopy and the tunnel diode.[10]
+
+When quantum systems interact, the result can be the creation of quantum entanglement: their properties become so intertwined that a description of the whole solely in terms of the individual parts is no longer possible. Erwin Schrödinger called entanglement "...the characteristic trait of quantum mechanics, the one that enforces its entire departure from classical lines of thought".[11] Quantum entanglement enables quantum computing and is part of quantum communication protocols, such as quantum key distribution and superdense coding.[12] Contrary to popular misconception, entanglement does not allow sending signals faster than light, as demonstrated by the no-communication theorem.[12]
+
+Another possibility opened by entanglement is testing for "hidden variables", hypothetical properties more fundamental than the quantities addressed in quantum theory itself, knowledge of which would allow more exact predictions than quantum theory can provide. A collection of results, most significantly Bell's theorem, have demonstrated that broad classes of such hidden-variable theories are in fact incompatible with quantum physics. According to Bell's theorem, if nature actually operates in accord with any theory of local hidden variables, then the results of a Bell test will be constrained in a particular, quantifiable way. Many Bell tests have been performed, using entangled particles, and they have shown results incompatible with the constraints imposed by local hidden variables.[13][14]
+
+It is not possible to present these concepts in more than a superficial way without introducing the actual mathematics involved; understanding quantum mechanics requires not only manipulating complex numbers, but also linear algebra, differential equations, group theory, and other more advanced subjects.[note 1] Accordingly, this article will present a mathematical formulation of quantum mechanics and survey its application to some useful and oft-studied examples.
+Mathematical formulation
+Main article: Mathematical formulation of quantum mechanics
+
+In the mathematically rigorous formulation of quantum mechanics, the state of a quantum mechanical system is a vector ψ \psi belonging to a (separable) complex Hilbert space H {\mathcal {H}}. This vector is postulated to be normalized under the Hilbert space inner product, that is, it obeys ⟨ ψ , ψ ⟩ = 1 {\displaystyle \langle \psi ,\psi \rangle =1}, and it is well-defined up to a complex number of modulus 1 (the global phase), that is, ψ \psi and e i α ψ {\displaystyle e^{i\alpha }\psi } represent the same physical system. In other words, the possible states are points in the projective space of a Hilbert space, usually called the complex projective space. The exact nature of this Hilbert space is dependent on the system – for example, for describing position and momentum the Hilbert space is the space of complex square-integrable functions L 2 ( C ) {\displaystyle L^{2}(\mathbb {C} )}, while the Hilbert space for the spin of a single proton is simply the space of two-dimensional complex vectors C 2 {\mathbb C}^{2} with the usual inner product.
+
+Physical quantities of interest – position, momentum, energy, spin – are represented by observables, which are Hermitian (more precisely, self-adjoint) linear operators acting on the Hilbert space. A quantum state can be an eigenvector of an observable, in which case it is called an eigenstate, and the associated eigenvalue corresponds to the value of the observable in that eigenstate. More generally, a quantum state will be a linear combination of the eigenstates, known as a quantum superposition. When an observable is measured, the result will be one of its eigenvalues with probability given by the Born rule: in the simplest case the eigenvalue λ \lambda is non-degenerate and the probability is given by | ⟨ λ → , ψ ⟩ | 2 {\displaystyle |\langle {\vec {\lambda }},\psi \rangle |^{2}}, where λ → {\displaystyle {\vec {\lambda }}} is its associated eigenvector. More generally, the eigenvalue is degenerate and the probability is given by ⟨ ψ , P λ ψ ⟩ {\displaystyle \langle \psi ,P_{\lambda }\psi \rangle }, where P λ P_{\lambda } is the projector onto its associated eigenspace. In the continuous case, these formulas give instead the probability density.
+
+After the measurement, if result λ \lambda was obtained, the quantum state is postulated to collapse to λ → {\displaystyle {\vec {\lambda }}}, in the non-degenerate case, or to P λ ψ / ⟨ ψ , P λ ψ ⟩ {\textstyle P_{\lambda }\psi {\big /}\!{\sqrt {\langle \psi ,P_{\lambda }\psi \rangle }}}, in the general case. The probabilistic nature of quantum mechanics thus stems from the act of measurement. This is one of the most difficult aspects of quantum systems to understand. It was the central topic in the famous Bohr–Einstein debates, in which the two scientists attempted to clarify these fundamental principles by way of thought experiments. In the decades after the formulation of quantum mechanics, the question of what constitutes a "measurement" has been extensively studied. Newer interpretations of quantum mechanics have been formulated that do away with the concept of "wave function collapse" (see, for example, the many-worlds interpretation). The basic idea is that when a quantum system interacts with a measuring apparatus, their respective wave functions become entangled so that the original quantum system ceases to exist as an independent entity. For details, see the article on measurement in quantum mechanics.[17]
+
+The time evolution of a quantum state is described by the Schrödinger equation:
+
+    i ℏ d d t ψ ( t ) = H ψ ( t ) . {\displaystyle i\hbar {\frac {d}{dt}}\psi (t)=H\psi (t).}
+
+Here H H denotes the Hamiltonian, the observable corresponding to the total energy of the system, and ℏ \hbar is the reduced Planck constant. The constant i ℏ i\hbar is introduced so that the Hamiltonian is reduced to the classical Hamiltonian in cases where the quantum system can be approximated by a classical system; the ability to make such an approximation in certain limits is called the correspondence principle.
+
+The solution of this differential equation is given by
+
+    ψ ( t ) = e − i H t / ℏ ψ ( 0 ) . {\displaystyle \psi (t)=e^{-iHt/\hbar }\psi (0).}
+
+The operator U ( t ) = e − i H t / ℏ {\displaystyle U(t)=e^{-iHt/\hbar }} is known as the time-evolution operator, and has the crucial property that it is unitary. This time evolution is deterministic in the sense that – given an initial quantum state ψ ( 0 ) \psi (0)  – it makes a definite prediction of what the quantum state ψ ( t ) \psi(t) will be at any later time.[18]
+Fig. 1: Probability densities corresponding to the wave functions of an electron in a hydrogen atom possessing definite energy levels (increasing from the top of the image to the bottom: n = 1, 2, 3, ...) and angular momenta (increasing across from left to right: s, p, d, ...). Denser areas correspond to higher probability density in a position measurement. Such wave functions are directly comparable to Chladni's figures of acoustic modes of vibration in classical physics and are modes of oscillation as well, possessing a sharp energy and thus, a definite frequency. The angular momentum and energy are quantized and take only discrete values like those shown. (As is the case for resonant frequencies in acoustics.)
+
+Some wave functions produce probability distributions that are independent of time, such as eigenstates of the Hamiltonian. Many systems that are treated dynamically in classical mechanics are described by such "static" wave functions. For example, a single electron in an unexcited atom is pictured classically as a particle moving in a circular trajectory around the atomic nucleus, whereas in quantum mechanics, it is described by a static wave function surrounding the nucleus. For example, the electron wave function for an unexcited hydrogen atom is a spherically symmetric function known as an s orbital (Fig. 1).
+
+Analytic solutions of the Schrödinger equation are known for very few relatively simple model Hamiltonians including the quantum harmonic oscillator, the particle in a box, the dihydrogen cation, and the hydrogen atom. Even the helium atom – which contains just two electrons – has defied all attempts at a fully analytic treatment.
+
+However, there are techniques for finding approximate solutions. One method, called perturbation theory, uses the analytic result for a simple quantum mechanical model to create a result for a related but more complicated model by (for example) the addition of a weak potential energy. Another method is called "semi-classical equation of motion", which applies to systems for which quantum mechanics produces only small deviations from classical behavior. These deviations can then be computed based on the classical motion. This approach is particularly important in the field of quantum chaos.
--- a/exllamav3/conversion/standard_cal_data/tiny.utf8
+++ b/exllamav3/conversion/standard_cal_data/tiny.utf8
--- a/exllamav3/conversion/standard_cal_data/wiki.utf8
+++ b/exllamav3/conversion/standard_cal_data/wiki.utf8
--- a/exllamav3/device.py
+++ b/exllamav3/device.py
@@ -0,0 +1,44 @@
+from __future__ import annotations
+import torch
+import torch.nn.functional as F
+from torch import nn
+from .constants import PAGE_SIZE
+from .models import Config
+
+device_contexts = {}
+
+class DeviceContext:
+
+    def __init__(
+        self,
+        config: Config,
+        device: torch.Device,
+    ):
+        self.reference_count = 0
+        self.device = device
+        self.config = config
+
+
+def get_key(
+    config: Config,
+    device: torch.Device,
+):
+    return f"{str(config.uuid)},{str(device)}"
+
+
+def get_device_context(config: Config, device: torch.device):
+    key = get_key(config, device)
+    if key not in device_contexts:
+        device_contexts[key] = DeviceContext(config, device)
+    dc = device_contexts[key]
+    dc.reference_count += 1
+    return dc
+
+
+def release_device_context(config: Config, device: torch.device):
+    key = get_key(config, device)
+    assert key in device_contexts
+    dc = device_contexts[key]
+    dc.reference_count -= 1
+    if dc.reference_count == 0:
+        del device_contexts[key]
--- a/exllamav3/exllamav3_ext/activation.cu
+++ b/exllamav3/exllamav3_ext/activation.cu
@@ -0,0 +1,117 @@
+#include "activation.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "util.h"
+#include "util.cuh"
+#include "compat.cuh"
+
+#define NUM_THREADS 256
+#define ACT_SILU 0
+#define ACT_GELU 1
+
+__device__ __forceinline__ half _silu(half x)
+{
+    half one = __float2half(1.0f);
+    half neg_x = __hneg(x);
+    half e = hexp(neg_x);
+    half sum = __hadd(one, e);
+    half r = hrcp(sum);
+    half result = __hmul(x, r);
+    return result;
+}
+
+__device__ __forceinline__ half2 _silu(half2 x)
+{
+    half2 one = __float2half2_rn(1.0f);
+    half2 neg_x = __hneg2(x);
+    half2 e = h2exp(neg_x);
+    half2 sum = __hadd2(one, e);
+    half2 r = h2rcp(sum);
+    half2 result = __hmul2(x, r);
+    return result;
+}
+
+__device__ __forceinline__ half _gelu(half x)
+{
+    float xf = __half2float(x);
+    const float c = 0.797884560803f;  // sqrt(2/Pi)
+    float tanh_arg = c * (xf + 0.044715f * xf * xf * xf);
+    xf = 0.5f * xf * (1.0 + tanh_opt(tanh_arg));
+    return __float2half_rn(xf);
+}
+
+__device__ __forceinline__ half2 _gelu(half2 x)
+{
+    return __halves2half2(_gelu(__low2half(x)), _gelu(__high2half(x)));
+}
+
+template <int activation_type>
+__global__ __launch_bounds__(NUM_THREADS)
+void act_mul_kernel
+(
+    const half* __restrict__ x,
+    const half* __restrict__ y,
+    half* __restrict__ z,
+    int numel
+)
+{
+    int idx = (blockIdx.x * NUM_THREADS + threadIdx.x);
+    if (idx >= numel / 2) return;
+
+    half2 x2 = ((const half2*) x)[idx];
+    half2 y2 = ((const half2*) y)[idx];
+
+    if constexpr (activation_type == ACT_SILU)
+        x2 = _silu(x2);
+    else if constexpr (activation_type == ACT_GELU)
+        x2 = _gelu(x2);
+
+    ((half2*) z)[idx] = __hmul2(x2, y2);
+}
+
+// silu(x) * y -> z, in-place if z == x or z == y
+
+void silu_mul
+(
+    const at::Tensor& x,
+    const at::Tensor& y,
+    at::Tensor& z
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(x.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    int numel = x.numel();
+    int blocks = CEIL_DIVIDE(numel, 2 * NUM_THREADS);
+    act_mul_kernel<ACT_SILU><<<blocks, NUM_THREADS, 0, stream>>>
+    (
+        (const half*) x.data_ptr(),
+        (const half*) y.data_ptr(),
+        (half*) z.data_ptr(),
+        numel
+    );
+}
+
+// silu(x) * y -> z, in-place if z == x or z == y
+
+void gelu_mul
+(
+    const at::Tensor& x,
+    const at::Tensor& y,
+    at::Tensor& z
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(x.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    int numel = x.numel();
+    int blocks = CEIL_DIVIDE(numel, 2 * NUM_THREADS);
+    act_mul_kernel<ACT_GELU><<<blocks, NUM_THREADS, 0, stream>>>
+    (
+        (const half*) x.data_ptr(),
+        (const half*) y.data_ptr(),
+        (half*) z.data_ptr(),
+        numel
+    );
+}
--- a/exllamav3/exllamav3_ext/activation.cuh
+++ b/exllamav3/exllamav3_ext/activation.cuh
@@ -0,0 +1,17 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void silu_mul
+(
+    const at::Tensor& x,
+    const at::Tensor& y,
+    at::Tensor& z
+);
+
+void gelu_mul
+(
+    const at::Tensor& x,
+    const at::Tensor& y,
+    at::Tensor& z
+);
--- a/exllamav3/exllamav3_ext/bindings.cpp
+++ b/exllamav3/exllamav3_ext/bindings.cpp
@@ -0,0 +1,59 @@
+#include <torch/extension.h>
+#include <pybind11/pybind11.h>
+#include <pybind11/stl.h>
+
+#include "stloader.h"
+#include "hadamard.h"
+
+#include "norm.cuh"
+#include "hgemm.cuh"
+#include "rope.cuh"
+#include "activation.cuh"
+#include "softcap.cuh"
+
+#include "quant/quantize.cuh"
+#include "quant/pack.cuh"
+#include "quant/reconstruct.cuh"
+#include "quant/hadamard.cuh"
+#include "quant/exl3_gemm.cuh"
+#include "quant/exl3_gemv.cuh"
+
+#include "generator/strings.h"
+#include "generator/sampling_basic.cuh"
+#include "generator/gumbel.cuh"
+
+PYBIND11_MODULE(TORCH_EXTENSION_NAME, m)
+{
+    m.def("stloader_read", &stloader_read, "stloader_read");
+    m.def("stloader_open_file", &stloader_open_file, "stloader_open_file");
+    m.def("stloader_close_file", &stloader_close_file, "stloader_close_file");
+
+    m.def("rms_norm", &rms_norm, "rms_norm");
+    m.def("softcap", &softcap, "softcap");
+
+    m.def("had_paley", &had_paley, "had_paley");
+    m.def("had_paley2", &had_paley2, "had_paley2");
+
+    m.def("quantize_tiles", &quantize_tiles, "quantize_tiles");
+    m.def("decode", &decode, "decode");
+    m.def("pack_trellis", &pack_trellis, "pack_trellis");
+    m.def("unpack_trellis", &unpack_trellis, "unpack_trellis");
+    m.def("pack_signs", &pack_signs, "pack_signs");
+    m.def("reconstruct", &reconstruct, "reconstruct");
+    m.def("had_r_128", &had_r_128, "had_r_128");
+    m.def("exl3_gemm", &exl3_gemm, "exl3_gemm");
+    m.def("exl3_gemm_num_kernel_variants", &exl3_gemm_num_kernel_variants, "exl3_gemm_num_kernel_variants");
+    m.def("hgemm", &hgemm, "hgemm");
+    m.def("rope", &rope, "rope");
+    m.def("silu_mul", &silu_mul, "silu_mul");
+    m.def("gelu_mul", &gelu_mul, "gelu_mul");
+
+    m.def("argmax_sample", &argmax_sample, "argmax_sample");
+    m.def("gumbel_sample", &gumbel_sample, "gumbel_sample");
+    m.def("gumbel_noise_f16", &gumbel_noise_f16, "gumbel_noise_f16");
+    m.def("gumbel_noise_f32", &gumbel_noise_f32, "gumbel_noise_f32");
+    m.def("gumbel_noise_log", &gumbel_noise_log, "gumbel_noise_log");
+
+    m.def("partial_strings_match", &partial_strings_match, "partial_strings_match");
+    m.def("count_match_tensor", &count_match_tensor, "count_match_tensor");
+}
--- a/exllamav3/exllamav3_ext/compat.cuh
+++ b/exllamav3/exllamav3_ext/compat.cuh
@@ -0,0 +1,29 @@
+#pragma once
+
+// Approximate tanh
+
+__forceinline__ __device__ float copysignf_pos(float a, float b)
+{
+    float r;
+    r = __int_as_float(__float_as_int(a) | (__float_as_int(b) & 0x80000000));
+    return r;
+}
+
+#if defined(USE_ROCM) || (defined(__CUDA_ARCH__) && (__CUDA_ARCH__ < 750 || CUDART_VERSION < 11000))
+
+__inline__ __device__ float tanh_opt(float x)
+{
+    const float exp_val = -1.f * fabs(2 * x);
+    return copysignf_pos((1.0f - __expf(exp_val)) / (__expf(exp_val) + 1.0f), x);
+}
+
+#else
+
+__inline__ __device__ float tanh_opt(float x)
+{
+    float r;
+    asm("tanh.approx.f32 %0,%1; \n\t" : "=f"(r) : "f"(x));
+    return r;
+}
+
+#endif
--- a/exllamav3/exllamav3_ext/generator/gumbel.cu
+++ b/exllamav3/exllamav3_ext/generator/gumbel.cu
@@ -0,0 +1,162 @@
+#include "sampling_basic.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "../util.h"
+#include "../util.cuh"
+#include <limits>
+#include <curand_kernel.h>
+
+#define NUM_THREADS 1024
+
+inline __device__ float gumbel(float x)
+{
+    return -__logf(fmaxf(-__logf(fmaxf(x, 1e-20)), 1e-20));
+}
+
+__global__ __launch_bounds__(NUM_THREADS)
+void gumbel_noise_kernel_f16
+(
+    const half* __restrict__ in_logits,
+    half* __restrict__ logits,
+    int size,
+    uint32_t random
+)
+{
+    int idx = (threadIdx.x + NUM_THREADS * blockIdx.x) * 2;
+    if (idx >= size) return;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(random, idx, 0, &state);
+
+    half2 x01 = *((half2*) (in_logits + idx));
+    float x0 = __half2float(__low2half(x01));
+    float x1 = __half2float(__high2half(x01));
+    float rf0 = curand_uniform(&state);
+    curand_init(random, idx + 1, 0, &state);
+    float rf1 = curand_uniform(&state);
+    x0 += gumbel(rf0);
+    x1 += gumbel(rf1);
+    x01 = __floats2half2_rn(x0, x1);
+    *((half2*) (logits + idx)) = x01;
+}
+
+__global__ __launch_bounds__(NUM_THREADS)
+void gumbel_noise_kernel_f32
+(
+    const float* __restrict__ in_logits,
+    float* __restrict__ logits,
+    int size,
+    uint32_t random
+)
+{
+    int idx = threadIdx.x + NUM_THREADS * blockIdx.x;
+    if (idx >= size) return;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(random, idx, 0, &state);
+
+    float x = in_logits[idx];
+    float rf = curand_uniform(&state);
+    x += gumbel(rf);
+    logits[idx] = x;
+}
+
+
+__global__ __launch_bounds__(NUM_THREADS)
+void gumbel_noise_kernel_log
+(
+    const float* __restrict__ probs,
+    float* __restrict__ logits,
+    int size,
+    uint32_t random
+)
+{
+    int idx = threadIdx.x + NUM_THREADS * blockIdx.x;
+    if (idx >= size) return;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(random, idx, 0, &state);
+
+    float x = probs[idx];
+    x = __logf(x);
+    float rf = curand_uniform(&state);
+    x += gumbel(rf);
+    logits[idx] = x;
+}
+
+void gumbel_noise_f16
+(
+    const at::Tensor& logits_in,
+    at::Tensor& logits,
+    uint32_t random
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(logits.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DTYPE(logits_in, kHalf);
+    TORCH_CHECK_DTYPE(logits, kHalf);
+
+    int size = logits.numel();
+    int blocks = CEIL_DIVIDE(size / 2, NUM_THREADS);
+
+    gumbel_noise_kernel_f16<<<blocks, NUM_THREADS, 0, stream>>>
+    (
+        (const half*) logits_in.data_ptr(),
+        (half*) logits.data_ptr(),
+        size,
+        random
+    );
+}
+
+void gumbel_noise_f32
+(
+    const at::Tensor& logits_in,
+    at::Tensor& logits,
+    uint32_t random
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(logits.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DTYPE(logits_in, kFloat);
+    TORCH_CHECK_DTYPE(logits, kFloat);
+
+    int size = logits.numel();
+    int blocks = CEIL_DIVIDE(size, NUM_THREADS);
+
+    gumbel_noise_kernel_f32<<<blocks, NUM_THREADS, 0, stream>>>
+    (
+        (const float*) logits_in.data_ptr(),
+        (float*) logits.data_ptr(),
+        size,
+        random
+    );
+}
+
+void gumbel_noise_log
+(
+    const at::Tensor& probs,
+    at::Tensor& logits,
+    uint32_t random
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(logits.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DTYPE(probs, kFloat);
+    TORCH_CHECK_DTYPE(logits, kFloat);
+    TORCH_CHECK_SHAPES_FULL(probs, logits);
+
+    int size = probs.numel();
+    int blocks = CEIL_DIVIDE(size, NUM_THREADS);
+
+    gumbel_noise_kernel_log<<<blocks, NUM_THREADS, 0, stream>>>
+    (
+        (const float*) probs.data_ptr(),
+        (float*) logits.data_ptr(),
+        size,
+        random
+    );
+}
--- a/exllamav3/exllamav3_ext/generator/gumbel.cuh
+++ b/exllamav3/exllamav3_ext/generator/gumbel.cuh
@@ -0,0 +1,24 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void gumbel_noise_f16
+(
+    const at::Tensor& logits_in,
+    at::Tensor& logits,
+    uint32_t random
+);
+
+void gumbel_noise_f32
+(
+    const at::Tensor& logits_in,
+    at::Tensor& logits,
+    uint32_t random
+);
+
+void gumbel_noise_log
+(
+    const at::Tensor& probs,
+    at::Tensor& logits,
+    uint32_t random
+);
--- a/exllamav3/exllamav3_ext/generator/sampling_basic.cu
+++ b/exllamav3/exllamav3_ext/generator/sampling_basic.cu
@@ -0,0 +1,197 @@
+#include "sampling_basic.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "../util.h"
+#include "../util.cuh"
+#include <limits>
+#include <curand_kernel.h>
+#include "../reduction.cuh"
+
+constexpr float NEG_INF_F32 = -std::numeric_limits<float>::infinity();
+constexpr float POS_INF_F32 = std::numeric_limits<float>::infinity();
+
+#define NUM_THREADS 1024
+
+inline __device__ float gumbel(float x)
+{
+    return -__logf(fmaxf(-__logf(fmaxf(x, 1e-20)), 1e-20));
+}
+
+inline __device__ ValIdx argmax2f(int idx, float& x0, float& x1)
+{
+    ValIdx vi;
+    if (x0 >= x1)
+    {
+        vi.val = x0;
+        vi.idx = idx;
+    }
+    else
+    {
+        vi.val = x1;
+        vi.idx = idx + 1;
+    }
+    vi = block_reduce_argmax(vi);
+    return vi;
+}
+
+inline __device__ bool read2f
+(
+    const half* logits_ptr,
+    int idx,
+    float& x0,
+    float& x1,
+    int num_logits,
+    int max_logit
+)
+{
+    if (idx >= num_logits)
+    {
+        x0 = NEG_INF_F32;
+        x1 = NEG_INF_F32;
+        return false;
+    }
+    else
+    {
+        half2 x0x1 = *((half2*) (logits_ptr + idx));
+        if (idx < max_logit - 1) x0 = __half2float(__low2half(x0x1));
+        else x0 = NEG_INF_F32;
+        if (idx < max_logit) x1 = __half2float(__high2half(x0x1));
+        else x1 = NEG_INF_F32;
+        return true;
+    }
+}
+
+__global__ __launch_bounds__(NUM_THREADS)
+void argmax_sample_kernel
+(
+    const half* __restrict__ logits,
+    uint64_t* __restrict__ ids,
+    int num_logits,
+    int max_logit
+)
+{
+    const half* logits_ptr = logits + num_logits * blockIdx.x;
+    uint64_t* ids_ptr = ids + blockIdx.x;
+
+    ValIdx maxvi = { NEG_INF_F32, 0 };
+    int idx = threadIdx.x * 2;
+    int blocks = CEIL_DIVIDE(max_logit, NUM_THREADS * 2);
+    for (int block = 0; block < blocks; ++block, idx += NUM_THREADS * 2)
+    {
+        float x0, x1;
+        read2f(logits_ptr, idx, x0, x1, num_logits, max_logit);
+        ValIdx vi = argmax2f(idx, x0, x1);
+        if (threadIdx.x == 0 && vi.val > maxvi.val)
+            maxvi = vi;
+    }
+
+    if (threadIdx.x == 0)
+        *ids_ptr = (uint64_t) maxvi.idx;
+}
+
+__global__ __launch_bounds__(NUM_THREADS)
+void gumbel_sample_kernel
+(
+    const half* __restrict__ logits,
+    uint64_t* __restrict__ ids,
+    int num_logits,
+    int max_logit,
+    uint32_t random
+)
+{
+    const half* logits_ptr = logits + num_logits * blockIdx.x;
+    uint64_t* ids_ptr = ids + blockIdx.x;
+
+    curandStatePhilox4_32_10_t state;
+    curand_init(random, threadIdx.x, 0, &state);
+
+    ValIdx maxvi = { NEG_INF_F32, 0 };
+    int idx = threadIdx.x * 2;
+    int blocks = CEIL_DIVIDE(max_logit, NUM_THREADS * 2);
+    for (int block = 0; block < blocks; ++block, idx += NUM_THREADS * 2)
+    {
+        float x0, x1;
+        if (read2f(logits_ptr, idx, x0, x1, num_logits, max_logit))
+        {
+            float rf0 = curand_uniform(&state);
+            float rf1 = curand_uniform(&state);
+            x0 += gumbel(rf0);
+            x1 += gumbel(rf1);
+        }
+        ValIdx vi = argmax2f(idx, x0, x1);
+        if (threadIdx.x == 0 && vi.val > maxvi.val)
+            maxvi = vi;
+    }
+
+    if (threadIdx.x == 0)
+        *ids_ptr = (uint64_t) maxvi.idx;
+}
+
+void common
+(
+    const at::Tensor& logits,
+    at::Tensor& ids,
+    int& bsz,
+    int& num_logits,
+    int& max_logit
+)
+{
+    TORCH_CHECK_DIM(logits, 2);
+    TORCH_CHECK_DIM(ids, 2);
+    TORCH_CHECK_DTYPE(logits, kHalf);
+    TORCH_CHECK_DTYPE(ids, kLong);
+    TORCH_CHECK_SHAPES(logits, 0, ids, 0, 1);
+
+    bsz = logits.size(0);
+    num_logits = logits.size(1);
+    if (max_logit > num_logits) max_logit = num_logits;
+}
+
+void argmax_sample
+(
+    const at::Tensor& logits,
+    at::Tensor& ids,
+    int max_logit
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(logits.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    if (!max_logit) max_logit = logits.size(-1);
+
+    int bsz, num_logits;
+    common(logits, ids, bsz, num_logits, max_logit);
+    argmax_sample_kernel<<<bsz, NUM_THREADS, 0, stream>>>
+    (
+        (const half*) logits.data_ptr(),
+        (uint64_t*) ids.data_ptr(),
+        num_logits,
+        max_logit
+    );
+}
+
+void gumbel_sample
+(
+    const at::Tensor& logits,
+    at::Tensor& ids,
+    int max_logit,
+    uint32_t random
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(logits.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    if (!max_logit) max_logit = logits.size(-1);
+
+    int bsz, num_logits;
+    common(logits, ids, bsz, num_logits, max_logit);
+    gumbel_sample_kernel<<<bsz, NUM_THREADS, 0, stream>>>
+    (
+        (const half*) logits.data_ptr(),
+        (uint64_t*) ids.data_ptr(),
+        num_logits,
+        max_logit,
+        random
+    );
+}
--- a/exllamav3/exllamav3_ext/generator/sampling_basic.cuh
+++ b/exllamav3/exllamav3_ext/generator/sampling_basic.cuh
@@ -0,0 +1,18 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void argmax_sample
+(
+    const at::Tensor& logits,
+    at::Tensor& ids,
+    int max_logit
+);
+
+void gumbel_sample
+(
+    const at::Tensor& logits,
+    at::Tensor& ids,
+    int max_logit,
+    uint32_t random
+);
--- a/exllamav3/exllamav3_ext/generator/strings.cpp
+++ b/exllamav3/exllamav3_ext/generator/strings.cpp
@@ -0,0 +1,72 @@
+#include "strings.h"
+#include "../util.h"
+
+// Compare string Q against list of strings S, utf-32 encoded and packed in byte array.
+//
+// Returns:
+// -1: No matches
+// -2: Partial match; at least one string in S partially overlaps Q on the right-hand side
+// >= 0: Index into Q of full match with at least one string in S
+
+int partial_strings_match
+(
+    py::buffer match,
+    py::buffer offsets,
+    py::buffer strings
+)
+{
+    py::buffer_info info;
+
+    info = match.request();
+    uint32_t* q = static_cast<uint32_t*>(info.ptr);
+    int q_len = info.size / 4;
+
+    info = offsets.request();
+    uint32_t* offsets_int = static_cast<uint32_t*>(info.ptr);
+    int num_strings = info.size / 4 - 1;
+
+    info = strings.request();
+    uint32_t* strings_utf32 = static_cast<uint32_t*>(info.ptr);
+
+    for (int i = 0; i < num_strings; ++i)
+    {
+        int beg = offsets_int[i] / 4;
+        int s_len = offsets_int[i + 1] / 4 - beg;
+        uint32_t* s = strings_utf32 + beg;
+
+        int a = 0;
+        int b = 0;
+        while (a < q_len)
+        {
+            int a0 = a;
+            while (q[a++] == s[b++])
+            {
+                if (b == s_len) return a0;
+                if (a == q_len) return -2;
+            }
+            a = a0 + 1;
+            b = 0;
+       }
+    }
+
+    return -1;
+}
+
+int count_match_tensor
+(
+    at::Tensor a,
+    at::Tensor b,
+    int max_a
+)
+{
+    uint64_t* pa = (uint64_t*) a.data_ptr();
+    uint64_t* pb = (uint64_t*) b.data_ptr();
+    int max_b = b.size(1);
+    if (max_b < max_a) max_a = max_b;
+
+    int match = 0;
+    while (match < max_a && *pa++ == *pb++)
+        match++;
+
+    return match;
+}
--- a/exllamav3/exllamav3_ext/generator/strings.h
+++ b/exllamav3/exllamav3_ext/generator/strings.h
@@ -0,0 +1,24 @@
+#pragma once
+
+#include <vector>
+#include <string>
+#include <pybind11/pybind11.h>
+#include <pybind11/pytypes.h>
+
+#include <ATen/Tensor.h>
+
+namespace py = pybind11;
+
+int partial_strings_match
+(
+    py::buffer match,
+    py::buffer offsets,
+    py::buffer strings
+);
+
+int count_match_tensor
+(
+    at::Tensor a,
+    at::Tensor b,
+    int max_a
+);
--- a/exllamav3/exllamav3_ext/hadamard.cpp
+++ b/exllamav3/exllamav3_ext/hadamard.cpp
@@ -0,0 +1,112 @@
+#include "hadamard.h"
+#include "util.h"
+
+#define HALF_P 0x3C00
+#define HALF_N 0xBC00
+#define HALF_PP 0x3C003C00
+#define HALF_PN 0xBC003C00
+#define HALF_NP 0x3C00BC00
+#define HALF_NN 0xBC00BC00
+
+inline int pmod(int a, int b)
+{
+    int ret = a % b;
+    if (ret < 0 && b > 0) ret += b;
+    return ret;
+}
+
+inline int modular_pow(int base, int exp, int mod)
+{
+    int result = 1;
+    base = pmod(base, mod);
+    while (exp > 0)
+    {
+        if (exp % 2 == 1) result = pmod((result * base), mod);
+        exp = exp >> 1;
+        base = pmod((base * base), mod);
+    }
+    return result;
+}
+
+inline bool is_quadratic_residue(int a, int p)
+{
+    return modular_pow(a, (p - 1) / 2, p) == 1;
+}
+
+// Paley construction
+
+void had_paley
+(
+    at::Tensor h
+)
+{
+    TORCH_CHECK_DTYPE(h, kHalf);
+    TORCH_CHECK_SHAPES(h, 0, h, 1, 1);
+    TORCH_CHECK(h.is_contiguous());
+    int n = h.size(0);
+    int p = n - 1;
+    uint16_t* ptr = (uint16_t*) h.data_ptr();
+
+    for (int j = 0; j < n; ++j)
+        *ptr++ = HALF_P;
+
+    for (int i = 0; i < p; ++i)
+    {
+        *ptr++ = HALF_N;
+        for (int j = 0; j < p; ++j)
+        {
+            if (i == j) *ptr++ = HALF_P;
+            else
+            {
+                int residue = pmod(i - j, p);
+                if (is_quadratic_residue(residue, p))
+                    *ptr++ = HALF_P;
+                else
+                    *ptr++ = HALF_N;
+            }
+        }
+    }
+}
+
+// Paley construction, type 2
+
+void had_paley2
+(
+    at::Tensor h
+)
+{
+    TORCH_CHECK_DTYPE(h, kHalf);
+    TORCH_CHECK_SHAPES(h, 0, h, 1, 1);
+    int n = h.size(0);
+    int p = n / 2 - 1;
+    uint32_t* ptr0 = (uint32_t*) h.data_ptr();
+    uint32_t* ptr1 = ptr0 + n / 2;
+
+    for (int i = 0; i < n / 2; ++i)
+    {
+        for (int j = 0; j < n / 2; ++j)
+        {
+            if (i == j)
+            {
+                *ptr0++ = HALF_PN;
+                *ptr1++ = HALF_NN;
+            }
+            else
+            {
+                int residue = pmod(i - j, p);
+                if (i == 0 || j == 0 || is_quadratic_residue(residue, p))
+                {
+                    *ptr0++ = HALF_PP;
+                    *ptr1++ = HALF_PN;
+                }
+                else
+                {
+                    *ptr0++ = HALF_NN;
+                    *ptr1++ = HALF_NP;
+                }
+            }
+        }
+        ptr0 += n / 2;
+        ptr1 += n / 2;
+    }
+}
--- a/exllamav3/exllamav3_ext/hadamard.h
+++ b/exllamav3/exllamav3_ext/hadamard.h
@@ -0,0 +1,13 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void had_paley
+(
+    at::Tensor h
+);
+
+void had_paley2
+(
+    at::Tensor h
+);
--- a/exllamav3/exllamav3_ext/hgemm.cu
+++ b/exllamav3/exllamav3_ext/hgemm.cu
@@ -0,0 +1,57 @@
+#include "hgemm.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "util.h"
+#include "util.cuh"
+
+/*
+
+Row-major float16 matmul using cuBLAS, a @ b -> c
+
+*/
+
+void hgemm
+(
+    at::Tensor a,
+    at::Tensor b,
+    at::Tensor c
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(a.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DTYPE(a, kHalf);
+    TORCH_CHECK_DTYPE(b, kHalf);
+    TORCH_CHECK_DTYPE(c, kHalf);
+    TORCH_CHECK_DIM(a, 2);
+    TORCH_CHECK_DIM(b, 2);
+    TORCH_CHECK_DIM(c, 2);
+    TORCH_CHECK_SHAPES(a, 0, c, 0, 1);
+    TORCH_CHECK_SHAPES(a, 1, b, 0, 1);
+    TORCH_CHECK_SHAPES(b, 1, c, 1, 1);
+
+    const half* a_ptr = (const half*) a.data_ptr();
+    const half* b_ptr = (const half*) b.data_ptr();
+    half* c_ptr = (half*) c.data_ptr();
+
+    int size_m = a.size(0);
+    int size_k = a.size(1);
+    int size_n = b.size(1);
+
+    cublasHandle_t cublas_handle = at::cuda::getCurrentCUDABlasHandle();
+
+    half alpha_ = __float2half(1.0f);
+    half beta_ = __float2half(0.0f);
+    cublasSetStream(cublas_handle, stream);
+    cublasHgemm
+    (
+        cublas_handle,
+        CUBLAS_OP_N,
+        CUBLAS_OP_N,
+        size_n, size_m, size_k,
+        &alpha_, b_ptr, size_n,
+                 a_ptr, size_k,
+        &beta_,  c_ptr, size_n
+    );
+}
--- a/exllamav3/exllamav3_ext/hgemm.cuh
+++ b/exllamav3/exllamav3_ext/hgemm.cuh
@@ -0,0 +1,10 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void hgemm
+(
+    at::Tensor a,
+    at::Tensor b,
+    at::Tensor c
+);
--- a/exllamav3/exllamav3_ext/norm.cu
+++ b/exllamav3/exllamav3_ext/norm.cu
@@ -0,0 +1,236 @@
+#include "norm.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "util.h"
+#include "util.cuh"
+
+#define NUM_THREADS 1024
+
+#if defined(USE_ROCM)
+#define NUM_WARPS (1024 / warpSize)
+#define WARP_SIZE (warpSize)
+#else
+#define NUM_WARPS 32
+#define WARP_SIZE 32
+#endif
+
+__device__ inline float reduce(float sum, int warp_id, int lane_id)
+{
+    // Shuffle to sum across lanes
+    __shared__ float sums[NUM_WARPS];
+    for(int offset = warpSize / 2; offset > 0; offset /= 2) sum += __shfl_xor_sync(0xffffffff, sum, offset);
+    if (lane_id == 0) sums[warp_id] = sum;
+    __syncthreads();
+
+    // Load partial sums from across warps, shuffle again across lanes
+    #if defined(USE_ROCM)
+        sum = lane_id < NUM_WARPS ? sums[lane_id] : 0.0f;
+    #else
+        sum = sums[lane_id];
+    #endif
+    for(int offset = warpSize / 2; offset > 0; offset /= 2) sum += __shfl_xor_sync(0xffffffff, sum, offset);
+
+    return sum;
+}
+
+__device__ inline void read_half4(float4& f4, const half4* addr, bool clamp)
+{
+    half4 h4;
+    READ64(h4, addr);
+    f4.x = LOW_TO_FLOAT(h4.x);
+    f4.y = HIGH_TO_FLOAT(h4.x);
+    f4.z = LOW_TO_FLOAT(h4.y);
+    f4.w = HIGH_TO_FLOAT(h4.y);
+    if (clamp)
+    {
+        f4.x = CLAMP_FP16(f4.x);
+        f4.y = CLAMP_FP16(f4.y);
+        f4.z = CLAMP_FP16(f4.z);
+        f4.w = CLAMP_FP16(f4.w);
+    }
+}
+
+__device__ inline void read_float4(float4& f4, const float4* addr)
+{
+    READ128(f4, addr);
+}
+
+__device__ inline void write_half4(const float4& f4, half4* addr)
+{
+    half4 h4
+    (
+        __halves2half2(__float2half_rn(f4.x), __float2half_rn(f4.y)),
+        __halves2half2(__float2half_rn(f4.z), __float2half_rn(f4.w))
+    );
+    WRITE64(addr, h4);
+}
+
+__device__ inline void write_float4(const float4& f4, float4* addr)
+{
+    WRITE128(addr, f4);
+}
+
+__device__ inline float sum_sq4(float lsum, const float4& f4)
+{
+    lsum = fma(f4.x, f4.x, lsum);
+    lsum = fma(f4.y, f4.y, lsum);
+    lsum = fma(f4.z, f4.z, lsum);
+    lsum = fma(f4.w, f4.w, lsum);
+    return lsum;
+}
+
+__device__ inline void apply4(float4& x4, const float4& w4, const float rmf)
+{
+    x4.x = x4.x * w4.x * rmf;
+    x4.y = x4.y * w4.y * rmf;
+    x4.z = x4.z * w4.z * rmf;
+    x4.w = x4.w * w4.w * rmf;
+}
+
+template <typename input_t, typename output_t>
+__global__ __launch_bounds__(NUM_THREADS)
+void rms_norm_kernel
+(
+    const input_t* __restrict__ x,
+    const half* __restrict__ w,
+    output_t* __restrict__ y,
+    const float epsilon,
+    const int rows,
+    const int dim,
+    float constant_bias
+)
+{
+    constexpr bool input_fp32 = std::is_same_v<input_t, float>;
+    constexpr bool output_fp32 = std::is_same_v<output_t, float>;
+    constexpr bool input_fp16 = std::is_same_v<input_t, half>;
+    constexpr bool output_fp16 = std::is_same_v<output_t, half>;
+    static_assert(input_fp32 || input_fp16, "rms_norm_kernel: input must be float or half type");
+    static_assert(output_fp32 || output_fp16, "rms_norm_kernel: output must be float or half type");
+
+    int t = threadIdx.x;
+    int warp_id = threadIdx.x / WARP_SIZE;
+    int lane_id = threadIdx.x % WARP_SIZE;
+    int row = blockIdx.x;
+
+    int columns = dim / 4;
+
+    // Compute sum of squares
+    float sum = 0.0f;
+    for (int column = t; column < columns; column += NUM_THREADS)
+    {
+        float4 x4;
+        if constexpr (input_fp16) read_half4(x4, ((const half4*) (x + row * dim)) + column, true);
+        if constexpr (input_fp32) read_float4(x4, ((const float4*) (x + row * dim)) + column);
+        sum = sum_sq4(sum, x4);
+    }
+    sum = reduce(sum, warp_id, lane_id);
+
+    // Get norm
+    float rmf = rsqrtf(sum / (float)dim + epsilon);
+
+    // Normalize x, scaling by w
+    for (int column = t; column < columns; column += NUM_THREADS)
+    {
+        float4 x4;
+        if constexpr (input_fp16) read_half4(x4, ((const half4*) (x + row * dim)) + column, true);
+        if constexpr (input_fp32) read_float4(x4, ((const float4*) (x + row * dim)) + column);
+
+        float4 w4;
+        read_half4(w4, ((const half4*) w) + column, false);
+        if (constant_bias != 0.0f)
+        {
+            w4.x += constant_bias;
+            w4.y += constant_bias;
+            w4.z += constant_bias;
+            w4.w += constant_bias;
+        }
+
+        apply4(x4, w4, rmf);
+
+        if constexpr (output_fp16) write_half4(x4, ((half4*) (y + row * dim)) + column);
+        if constexpr (output_fp32) write_float4(x4, ((float4*) (y + row * dim)) + column);
+    }
+}
+
+/*
+Compute RMSNorm: y = x * w / sqrt(row_mean(x * x) + epsilon)
+- Can operate in-place if y == x
+- x can be either float or half dtype
+- y can be either float or half dtype
+- w must be half dtype
+*/
+void rms_norm
+(
+    at::Tensor x,
+    at::Tensor w,
+    at::Tensor y,
+    float epsilon,
+    float constant_bias
+)
+{
+    TORCH_CHECK_DTYPE(w, kHalf);
+    TORCH_CHECK_DIV(x, 1, 4);
+    TORCH_CHECK_SHAPES(x, 1, w, 0, 1);
+    TORCH_CHECK_SHAPES(x, 0, y, 0, 1);
+    TORCH_CHECK_SHAPES(x, 1, y, 1, 1);
+
+    const at::cuda::OptionalCUDAGuard device_guard(x.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    bool input_fp32 = x.dtype() == at::kFloat;
+    bool output_fp32 = y.dtype() == at::kFloat;
+    bool input_fp16 = !input_fp32;
+    bool output_fp16 = !output_fp32;
+    int rows = x.size(0);
+    int dim = x.size(1);
+    dim3 blockDim(NUM_THREADS, 1, 1);
+    dim3 gridDim(rows, 1, 1);
+
+    if (input_fp16 && output_fp16)
+        rms_norm_kernel<<<gridDim, blockDim, 0, stream>>>
+        (
+            (const half*) x.data_ptr(),
+            (const half*) w.data_ptr(),
+            (half*) y.data_ptr(),
+            epsilon,
+            rows,
+            dim,
+            constant_bias
+        );
+    else if (input_fp16 && output_fp32)
+        rms_norm_kernel<<<gridDim, blockDim, 0, stream>>>
+        (
+            (const half*) x.data_ptr(),
+            (const half*) w.data_ptr(),
+            (float*) y.data_ptr(),
+            epsilon,
+            rows,
+            dim,
+            constant_bias
+        );
+    else if (input_fp32 && output_fp16)
+        rms_norm_kernel<<<gridDim, blockDim, 0, stream>>>
+        (
+            (const float*) x.data_ptr(),
+            (const half*) w.data_ptr(),
+            (half*) y.data_ptr(),
+            epsilon,
+            rows,
+            dim,
+            constant_bias
+        );
+    else if (input_fp32 && output_fp32)
+        rms_norm_kernel<<<gridDim, blockDim, 0, stream>>>
+        (
+            (const float*) x.data_ptr(),
+            (const half*) w.data_ptr(),
+            (float*) y.data_ptr(),
+            epsilon,
+            rows,
+            dim,
+            constant_bias
+        );
+    else
+        TORCH_CHECK(false, "rms_norm: Invalid datatypes for input/output, must be half or float")
+}
--- a/exllamav3/exllamav3_ext/norm.cuh
+++ b/exllamav3/exllamav3_ext/norm.cuh
@@ -0,0 +1,12 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void rms_norm
+(
+    at::Tensor x,
+    at::Tensor w,
+    at::Tensor y,
+    float epsilon,
+    float constant_bias
+);
--- a/exllamav3/exllamav3_ext/ptx.cuh
+++ b/exllamav3/exllamav3_ext/ptx.cuh
@@ -0,0 +1,246 @@
+#pragma once
+
+// Tensor core fragments
+
+template <typename T, int n>
+struct Vec
+{
+    T elems[n];
+    __device__ T& operator[](int i) { return elems[i]; }
+};
+
+using FragA = Vec<half2, 4>;
+using FragB = Vec<half2, 2>;
+using FragC = Vec<float, 4>;
+using FragC_h = Vec<half2, 2>;
+
+// m8n8k4 tensor core matmul (emulated on Ampere and later), don't use
+//
+// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m8n8k4-with-f16-floating-point-type
+
+__device__ inline void ptx_mma_m8n8k4
+(
+    const Vec<half2, 2>& frag_a,
+    const Vec<half2, 2>& frag_b,
+    Vec<float, 8>& frag_c
+)
+{
+    const uint32_t* a = reinterpret_cast<const uint32_t*>(&frag_a);
+    const uint32_t* b = reinterpret_cast<const uint32_t*>(&frag_b);
+    float* c = reinterpret_cast<float*>(&frag_c);
+    const float* d = reinterpret_cast<const float*>(&frag_c);
+
+    asm volatile
+    (
+        "mma.sync.aligned.m8n8k4.row.col.f32.f16.f16.f32 "
+        "{%0,%1,%2,%3,%4,%5,%6,%7}, {%8,%9}, {%10,%11}, {%12,%13,%14,%15,%16,%17,%18,%19};\n"
+
+        : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3]),"=f"(c[4]), "=f"(c[5]), "=f"(c[6]), "=f"(c[7])
+
+        :  "r"(a[0]), "r"(a[1]),
+           "r"(b[0]), "r"(b[1]),
+           "f"(d[0]), "f"(d[1]), "f"(d[2]), "f"(d[3]), "f"(d[4]), "f"(d[5]), "f"(d[6]), "f"(d[7])
+    );
+}
+
+// m16n8k16 tensor core matmul
+//
+// https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-floating-point-type
+
+// FP16 @ FP16 + FP32 -> FP32
+__device__ inline void ptx_mma_m16n8k16
+(
+    const FragA& frag_a,
+    const FragB& frag_b,
+    FragC& frag_c
+)
+{
+    const uint32_t* a = reinterpret_cast<const uint32_t*>(&frag_a);
+    const uint32_t* b = reinterpret_cast<const uint32_t*>(&frag_b);
+    float* c = reinterpret_cast<float*>(&frag_c);
+    const float* d = reinterpret_cast<const float*>(&frag_c);
+
+    asm volatile
+    (
+        "mma.sync.aligned.m16n8k16.row.col.f32.f16.f16.f32 "
+        "{%0,%1,%2,%3}, {%4,%5,%6,%7}, {%8,%9}, {%10,%11,%12,%13};\n"
+
+        : "=f"(c[0]), "=f"(c[1]), "=f"(c[2]), "=f"(c[3])
+        :  "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
+           "r"(b[0]), "r"(b[1]),
+           "f"(d[0]), "f"(d[1]), "f"(d[2]), "f"(d[3])
+    );
+}
+
+// FP16 @ FP16 + FP16 -> FP16
+__device__ inline void ptx_mma_m16n8k16
+(
+    const FragA& frag_a,
+    const FragB& frag_b,
+    FragC_h& frag_c
+)
+{
+    const uint32_t* a = reinterpret_cast<const uint32_t*>(&frag_a);
+    const uint32_t* b = reinterpret_cast<const uint32_t*>(&frag_b);
+    uint32_t* c = reinterpret_cast<uint32_t*>(&frag_c);
+    const uint32_t* d = reinterpret_cast<const uint32_t*>(&frag_c);
+
+    asm volatile
+    (
+        "mma.sync.aligned.m16n8k16.row.col.f16.f16.f16.f16 "
+        "{%0,%1}, {%2,%3,%4,%5}, {%6,%7}, {%8,%9};\n"
+
+        : "=r"(c[0]), "=r"(c[1])
+        :  "r"(a[0]), "r"(a[1]), "r"(a[2]), "r"(a[3]),
+           "r"(b[0]), "r"(b[1]),
+           "r"(d[0]), "r"(d[1])
+    );
+}
+
+// Global barrier
+
+__device__ inline void barrier_acquire
+(
+    int* lock,
+    int stage
+)
+{
+    if (threadIdx.x == 0)
+    {
+        volatile int state = -1;
+        do
+        {
+            asm volatile ("ld.global.acquire.gpu.b32 %0, [%1];\n" : "=r"(state) : "l"(lock));
+        }
+        while (state != stage);
+    }
+    __syncthreads();
+}
+
+__device__ inline void barrier_release
+(
+    int* lock,
+    int val,
+    bool reset
+)
+{
+    __syncthreads();
+    if (threadIdx.x == 0)
+    {
+        if (reset)
+        {
+            *lock = 0;
+            return;
+        }
+        asm volatile ("fence.acq_rel.gpu;\n");
+        asm volatile ("red.relaxed.gpu.global.add.s32 [%0], %1;\n" : : "l"(lock), "r"(val));
+    }
+}
+
+// Load global to shared memory, predicated. Seems to produce incorrect code when compiling for Blackwell, but
+// `if (...) cp_async(...)` compiles to a predicated instruction anyway
+
+__device__ inline void cp_async_pred(void* smem_ptr, const void* glob_ptr, bool pred = true)
+{
+    const int bytes = 16;
+    uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    asm volatile(
+        "{\n"
+        "   .reg .pred p;\n"
+        "   setp.ne.b32 p, %0, 0;\n"
+        "   @p cp.async.cg.shared.global [%1], [%2], %3;\n"
+        "}\n" :: "r"((int) pred), "r"(smem), "l"(glob_ptr), "n"(bytes)
+    );
+}
+
+// Load global to shared memory
+
+__device__ inline void cp_async(void* smem_ptr, const void* glob_ptr)
+{
+    const int bytes = 16;
+    uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    asm volatile(
+        "{\n"
+        "   cp.async.cg.shared.global [%0], [%1], %2;\n"
+        "}\n" :: "r"(smem), "l"(glob_ptr), "n"(bytes)
+    );
+}
+
+// Load global to shared memory with cache hint to evict data from L2 ASAP
+
+__device__ inline void cp_async_stream(void* smem_ptr, const void* glob_ptr)
+{
+    uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    const int bytes = 16;
+    asm volatile
+    (
+        "{\n"
+        "   .reg .b64 p;\n"
+        "   createpolicy.fractional.L2::evict_first.b64 p, 1.0;\n"
+        "   cp.async.cg.shared.global.L2::cache_hint [%0], [%1], %2, p;\n"
+        "}\n" :: "r"(smem), "l"(glob_ptr), "n"(bytes)
+    );
+}
+
+// Async copy fence, commit all pending async copies
+
+__device__ inline void cp_async_fence()
+{
+    asm volatile("cp.async.commit_group;\n" ::);
+}
+
+// Wait until at most n async groups are still pending.
+
+template <int n>
+__device__ inline void cp_async_wait()
+{
+    asm volatile("cp.async.wait_group %0;\n" :: "n"(n));
+}
+
+// Load 16x16 matrix fragment from shared memory, directly in tensor core layout
+
+__device__ inline void ldsm4(FragA& frag_a, const void* smem_ptr)
+{
+    uint32_t* a = reinterpret_cast<uint32_t*>(&frag_a);
+    uint32_t smem = static_cast<uint32_t>(__cvta_generic_to_shared(smem_ptr));
+    asm volatile
+    (
+        "ldmatrix.sync.aligned.m8n8.x4.shared.b16 {%0,%1,%2,%3}, [%4];\n"
+        : "=r"(a[0]), "=r"(a[1]), "=r"(a[2]), "=r"(a[3]) : "r"(smem)
+    );
+}
+
+__device__ inline uint32_t mul_lo_u32(uint32_t x, uint32_t y)
+{
+    uint32_t w;
+    asm volatile
+    (
+        "mul.lo.u32 %0, %1, %2;"
+        : "=r"(w)
+        :  "r"(x), "r"(y)
+    );
+    return w;
+}
+
+__device__ inline uint32_t mul_hi_u32(uint32_t x, uint32_t y)
+{
+    uint32_t w;
+    asm volatile
+    (
+        "mul.hi.u32 %0, %1, %2;"
+        : "=r"(w)
+        :  "r"(x), "r"(y)
+    );
+    return w;
+}
+
+static __forceinline__ __device__ uint32_t bfe64(uint32_t lo, uint32_t hi, int offset, int length)
+{
+    uint64_t value = (static_cast<uint64_t>(hi) << 32) | static_cast<uint64_t>(lo);
+    uint64_t result64;
+    asm volatile ("bfe.u64 %0, %1, %2, %3;"
+                  : "=l"(result64)
+                  : "l"(value), "r"(offset), "r"(length));
+
+    return static_cast<uint32_t>(result64);
+}
--- a/exllamav3/exllamav3_ext/quant/codebook.cuh
+++ b/exllamav3/exllamav3_ext/quant/codebook.cuh
@@ -0,0 +1,89 @@
+#pragma once
+
+// "3INST" procedural codebook
+
+__device__ inline half decode_3inst(uint32_t x)
+{
+    x *= 89226354u;
+    x += 64248484u;
+    x &= 0b10001111111111111000111111111111u;
+    x ^= 0b00111011011000000011101101100000u;
+    half2_uint32 xu(x);
+    return __hadd(__low2half(xu.as_half2), __high2half(xu.as_half2));
+}
+
+__device__ inline half2 decode_3inst_2(uint32_t x0, uint32_t x1)
+{
+    x0 *= 89226354u;
+    x1 *= 89226354u;
+    x0 += 64248484u;
+    x1 += 64248484u;
+    x0 &= 0b10001111111111111000111111111111u;
+    x1 &= 0b10001111111111111000111111111111u;
+    x0 ^= 0b00111011011000000011101101100000u;
+    x1 ^= 0b00111011011000000011101101100000u;
+    half2_uint32 xu0(x0);
+    half2_uint32 xu1(x1);
+    half2 d0 = __halves2half2(__low2half(xu0.as_half2), __low2half(xu1.as_half2));
+    half2 d1 = __halves2half2(__high2half(xu0.as_half2), __high2half(xu1.as_half2));
+    return __hadd2(d0, d1);
+}
+
+__device__ inline float decode_3inst_f(uint64_t x)
+{
+    return __half2float(decode_3inst(x));
+}
+
+__device__ inline float decode_3inst_f_diff(uint64_t x, float d)
+{
+    return __half2float(decode_3inst(x)) - d;
+}
+
+// "2MAD" procedural codebook, much more overhead than 3INST, slightly better distribution at 2bpw
+
+__device__ inline half decode_2mad(uint64_t x)
+{
+    x = x * 264435761u + 1013904223u;
+    x = ((x * 1664525u) >> 32) + x;
+    int32_t c = (int32_t) __dp4a((uint32_t) x, 0x01010101u, 0xFFFFFE02u);
+    half y = __hmul(__int2half_rn(c), __float2half_rn(0.008415));
+    return y;
+}
+
+__device__ inline float decode_2mad_f(uint64_t x)
+{
+    x = x * 264435761u + 1013904223u;
+    x = ((x * 1664525u) >> 32) + x;
+    int32_t c = (int32_t) __dp4a((uint32_t) x, 0x01010101u, 0xFFFFFE02u);
+    float y = __int2float_rn(c) * 0.008415f;
+    return y;
+}
+
+__device__ inline float decode_2mad_f_diff(uint64_t x, float d)
+{
+    x = x * 264435761u + 1013904223u;
+    x = ((x * 1664525u) >> 32) + x;
+    int32_t c = (int32_t) __dp4a((uint32_t) x, 0x01010101u, 0xFFFFFE02u);
+    float y = fma(__int2float_rn(c), 0.008415f, -d);
+    return y;
+}
+
+//
+
+__device__ inline half decode_pcb(uint64_t x)
+{
+//    return decode_2mad(x);
+    return decode_3inst(x);
+}
+
+__device__ inline float decode_pcb_f(uint64_t x)
+{
+//    return decode_2mad_f(x);
+    return decode_3inst_f(x);
+}
+
+__device__ inline float decode_pcb_f_diff(uint64_t x, float d)
+{
+//    return decode_2mad_f_diff(x, d);
+    return decode_3inst_f_diff(x, d);
+}
--- a/exllamav3/exllamav3_ext/quant/exl3_dq.cuh
+++ b/exllamav3/exllamav3_ext/quant/exl3_dq.cuh
@@ -0,0 +1,255 @@
+#pragma once
+
+#include "codebook.cuh"
+
+__device__ __forceinline__ uint32_t fshift(uint32_t b, uint32_t a, int shift)
+{
+    // uint64_t merged = ((uint64_t)b << 32) | (uint64_t) a;
+    // return (uint32_t)(merged >> shift);
+
+    // Conditional funnel shift is somehow faster
+    if (shift < 32) return __funnelshift_r(b, a, shift);
+    return a >> (shift - 32);
+}
+
+template <int bits>
+__device__ __forceinline__ half dq(const uint32_t* ptr, int t_offset)
+{
+    int b0 = t_offset * bits + bits - 16 + 256 * bits;  // bit index, start of word0
+    int b1 = b0 + 16;                                   // bit index, end of word0
+    int i0 = b0 / 32;                                   // uint32 containing first bit of word0
+    int i1 = (b1 - 1) / 32;                             // uint32 containing last bit of word0, may be == i0
+    int s0 = (i1 + 1) * 32 - b1;                        // shift value to align word1 to 32-bit boundary
+
+    // Load 32 or 64 bits containing word0
+    uint32_t a = ptr[i0 % (bits * 256 / 32)];
+    uint32_t b = ptr[i1 % (bits * 256 / 32)];
+
+    // Shift into place
+    uint32_t w0 = __funnelshift_r(b, a, s0) & 0xffff;
+    return decode_3inst(w0);
+}
+
+template <int bits>
+__device__ __forceinline__ half2 dq2(const uint32_t* ptr, int t_offset)
+{
+    int b0 = t_offset * bits + bits - 16 + 256 * bits;  // bit index, start of word0
+    int b1 = b0 + 16;                                   // bit index, end of word0
+    int i0 = b0 / 32;                                   // uint32 containing first bit of word0
+    int i1 = (b1 - 1) / 32;                             // uint32 containing last bit of word0, may be == i0
+    int s0 = (i1 + 1) * 32 - b1;                        // shift value to align word1 to 32-bit boundary
+
+    // Load 32 or 64 bits containing word0
+    uint32_t a = ptr[i0 % (bits * 256 / 32)];
+    uint32_t b = ptr[i1 % (bits * 256 / 32)];
+
+    // Shift into place
+    uint32_t w1 = __funnelshift_r(b, a, s0)        & 0xffff;
+    uint32_t w0 = __funnelshift_r(b, a, s0 + bits) & 0xffff;
+    return decode_3inst_2(w0, w1);
+}
+
+template <int bits>
+__device__ __forceinline__ void dq4(const uint32_t* ptr, int t_offset, FragB& frag)
+{
+    int b0 = (t_offset + 257) * bits - 16;      // start of first word
+    int b1 = b0 + 3 * bits;                     // start of last word
+    int b2 = b1 + 16;                           // end of last word
+    int i0 = b0 / 32;                           // uint32 containing first bit of first word
+    int i2 = (b2 - 1) / 32;                     // uint32 containing last bit of last word, may be == i0
+    int s2 = (i2 + 1) * 32 - b2;                // shift value to align last word to 32-bit boundary
+
+    uint32_t a = ptr[i0 % (bits * 256 / 32)];
+    uint32_t b = ptr[i2 % (bits * 256 / 32)];
+    uint32_t w3 = fshift(b, a, s2)            & 0xffff;
+    uint32_t w2 = fshift(b, a, s2 + bits)     & 0xffff;
+    uint32_t w1 = fshift(b, a, s2 + bits * 2) & 0xffff;
+    uint32_t w0 = fshift(b, a, s2 + bits * 3) & 0xffff;
+    half2 d0d1 = decode_3inst_2(w0, w1);
+    half2 d2d3 = decode_3inst_2(w2, w3);
+    frag[0] = d0d1;
+    frag[1] = d2d3;
+}
+
+template <int bits>
+__device__ __forceinline__ void dq2x2(const uint32_t* ptr, int t_offset, FragB& frag)
+{
+    #pragma unroll
+    for (int i = 0; i < 2; ++i)
+    {
+        int b0 = (t_offset + 2 * i + 257) * bits - 16;  // start of first word
+        int b1 = b0 + 1 * bits;                         // start of last word
+        int b2 = b1 + 16;                               // end of last word
+        int i0 = b0 / 32;                               // uint32 containing first bit of first word
+        int i2 = (b2 - 1) / 32;                         // uint32 containing last bit of last word, may be == i0
+        int s2 = (i2 + 1) * 32 - b2;                    // shift value to align last word to 32-bit boundary
+
+        uint32_t a = ptr[i0 % (bits * 256 / 32)];
+        uint32_t b = ptr[i2 % (bits * 256 / 32)];
+        uint32_t w1 = fshift(b, a, s2)        & 0xffff;
+        uint32_t w0 = fshift(b, a, s2 + bits) & 0xffff;
+        half2 d0d1 = decode_3inst_2(w0, w1);
+        frag[i] = d0d1;
+    }
+}
+
+template <int bits, int align>
+__device__ __forceinline__ void dq8(const uint32_t* ptr, int t_offset, FragB& frag0, FragB& frag1)
+{
+    int b1 = (t_offset + 257) * bits;               // end of first word
+    int b0 = b1 - 16;                               // start of first word
+    int b2 = b1 + bits * 7;
+    int i0 = b0 / 32;                               // uint32 containing first bit of word0
+    int i2 = (b2 - 1) / 32;                         // uint32 containing last bit of word0, may be == i0
+    int s2 = (i2 + 1) * 32 - b2;                    // shift value to align last word to 32-bit boundary
+
+    uint32_t a = ptr[i0 % (bits * 256 / 32)];
+    uint32_t b = ptr[i2 % (bits * 256 / 32)];
+    uint32_t w0, w1, w2, w3, w4, w5, w6, w7;
+    if constexpr (align == 1)
+    {
+        w7 = fshift(b, a, s2);
+        w6 = fshift(b, a, s2 + bits);
+        w5 = fshift(b, a, s2 + bits * 2);
+        w4 = fshift(b, a, s2 + bits * 3);
+        w3 = fshift(b, a, s2 + bits * 4);
+        w2 = fshift(b, a, s2 + bits * 5);
+        w1 = fshift(b, a, s2 + bits * 6);
+        w0 = fshift(b, a, s2 + bits * 7);
+    }
+    if constexpr (align == 2)
+    {
+        w7 = fshift(b, a, s2);
+        w6 = w7 >> bits;
+        w5 = fshift(b, a, s2 + bits * 2);
+        w4 = w5 >> bits;
+        w3 = fshift(b, a, s2 + bits * 4);
+        w2 = w3 >> bits;
+        w1 = fshift(b, a, s2 + bits * 6);
+        w0 = w1 >> bits;
+    }
+    if constexpr (align == 4)
+    {
+        w7 = fshift(b, a, s2);
+        w6 = w7 >> bits;
+        w5 = w6 >> bits;
+        w4 = w5 >> bits;
+        w3 = fshift(b, a, s2 + bits * 4);
+        w2 = w3 >> bits;
+        w1 = w2 >> bits;
+        w0 = w1 >> bits;
+    }
+    if constexpr (align == 8)
+    {
+        w7 = fshift(b, a, s2);
+        w6 = w7 >> bits;
+        w5 = w6 >> bits;
+        w4 = w5 >> bits;
+        w3 = w4 >> bits;
+        w2 = w3 >> bits;
+        w1 = w2 >> bits;
+        w0 = w1 >> bits;
+    }
+    half2 d0d1 = decode_3inst_2(w0 & 0xffff, w1 & 0xffff);
+    half2 d2d3 = decode_3inst_2(w2 & 0xffff, w3 & 0xffff);
+    half2 d4d5 = decode_3inst_2(w4 & 0xffff, w5 & 0xffff);
+    half2 d6d7 = decode_3inst_2(w6 & 0xffff, w7 & 0xffff);
+    frag0[0] = d0d1;
+    frag0[1] = d2d3;
+    frag1[0] = d4d5;
+    frag1[1] = d6d7;
+}
+
+__device__ __forceinline__ void dq8_aligned_4bits(const uint32_t* ptr, int t_offset, FragB& frag0, FragB& frag1)
+{
+    int i1 = t_offset / 8;
+    int i0 = (i1 + 31) % 32;
+
+    uint32_t a = ptr[i0];
+    uint32_t b = ptr[i1];
+    uint32_t w7 = b & 0xffff;
+    uint32_t w6 = (b >> 4) & 0xffff;
+    uint32_t w5 = (b >> 8) & 0xffff;
+    uint32_t w4 = (b >> 12) & 0xffff;
+    uint32_t w3 = (b >> 16) & 0xffff;
+    uint32_t w2 = __funnelshift_r(b, a, 20);
+    uint32_t w1 = w2 >> 4;
+    uint32_t w0 = w2 >> 8;
+    w2 = w2 & 0xffff;
+    w1 = w1 & 0xffff;
+    w0 = w0 & 0xffff;
+    half2 d0d1 = decode_3inst_2(w0, w1);
+    half2 d2d3 = decode_3inst_2(w2, w3);
+    half2 d4d5 = decode_3inst_2(w4, w5);
+    half2 d6d7 = decode_3inst_2(w6, w7);
+    frag0[0] = d0d1;
+    frag0[1] = d2d3;
+    frag1[0] = d4d5;
+    frag1[1] = d6d7;
+}
+
+__device__ __forceinline__ void dq8_aligned_4bits_bfe(const uint32_t* ptr, int t_offset, FragB& frag0, FragB& frag1)
+{
+    int i1 = t_offset / 8;
+    int i0 = (i1 + 31) % 32;
+
+    uint32_t a = ptr[i0];
+    uint32_t b = ptr[i1];
+    uint32_t w7 = bfe64(b, a, 0, 16);
+    uint32_t w6 = bfe64(b, a, 4, 16);
+    uint32_t w5 = bfe64(b, a, 8, 16);
+    uint32_t w4 = bfe64(b, a, 12, 16);
+    uint32_t w3 = bfe64(b, a, 16, 16);
+    uint32_t w2 = bfe64(b, a, 20, 16);
+    uint32_t w1 = bfe64(b, a, 24, 16);
+    uint32_t w0 = bfe64(b, a, 28, 16);
+    half2 d0d1 = decode_3inst_2(w0, w1);
+    half2 d2d3 = decode_3inst_2(w2, w3);
+    half2 d4d5 = decode_3inst_2(w4, w5);
+    half2 d6d7 = decode_3inst_2(w6, w7);
+    frag0[0] = d0d1;
+    frag0[1] = d2d3;
+    frag1[0] = d4d5;
+    frag1[1] = d6d7;
+}
+
+template <int bits>
+__device__ __forceinline__ void dq_dispatch(const uint32_t* ptr, int idx, FragB& frag0, FragB& frag1)
+{
+    if constexpr (bits == 1)
+    {
+        dq8<bits, 4>(ptr, idx, frag0, frag1);
+    }
+    else if constexpr (bits == 2)
+    {
+        dq8<bits, 4>(ptr, idx, frag0, frag1);
+    }
+    else if constexpr (bits == 3)
+    {
+        dq8<bits, 2>(ptr, idx, frag0, frag1);
+    }
+    else if constexpr (bits == 4)
+    {
+        dq8_aligned_4bits(ptr, idx, frag0, frag1);
+    }
+    else if constexpr (bits == 5)
+    {
+        dq4<bits>(ptr, idx, frag0);
+        dq4<bits>(ptr, idx + 4, frag1);
+    }
+    else if constexpr (bits == 6)
+    {
+        dq4<bits>(ptr, idx, frag0);
+        dq4<bits>(ptr, idx + 4, frag1);
+    }
+    else if constexpr (bits == 7)
+    {
+        dq2x2<bits>(ptr, idx, frag0);
+        dq2x2<bits>(ptr, idx + 4, frag1);
+    }
+    else if constexpr (bits == 8)
+    {
+        dq4<bits>(ptr, idx, frag0);
+        dq4<bits>(ptr, idx + 4, frag1);
+    }
+}
--- a/exllamav3/exllamav3_ext/quant/exl3_gemm.cu
+++ b/exllamav3/exllamav3_ext/quant/exl3_gemm.cu
@@ -0,0 +1,293 @@
+#include "exl3_gemm.cuh"
+
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "../util.h"
+#include "../util.cuh"
+#include "../ptx.cuh"
+#include <tuple>
+#include <mutex>
+#include "exl3_dq.cuh"
+#include "hadamard.cuh"
+
+// Constants
+#define NUM_THREADS 256
+#define SMEM_MAX (90 * 1024)  // max shared memory on compute capability 8.6
+
+// Max allowable output size, in tiles. Used to allocate global lock buffer per device for sync across threadblocks
+#define MAX_TILES_C (1024 * 1024)
+
+#include "exl3_gemm_kernel.cuh"
+
+// Singleton to manage context for each device. Stores device attributes and a large-enough lock tensor per device
+
+#define MAX_DEVICES 32
+#define CC_OLD        1
+#define CC_AMPERE     2
+#define CC_ADA        3
+#define CC_HOPPER     4
+#define CC_BLACKWELL  5
+
+class DevCtx
+{
+private:
+    int num_sms[MAX_DEVICES] = {};
+    int cc[MAX_DEVICES] = {};
+    void* locks[MAX_DEVICES] = {};
+    std::mutex mtx;
+
+public:
+    static DevCtx& instance()
+    {
+        static DevCtx ctx;
+        return ctx;
+    }
+
+    int get_num_sms(int device)
+    {
+        std::lock_guard<std::mutex> lock(mtx);
+        if (!num_sms[device])
+            cuda_check(cudaDeviceGetAttribute(&num_sms[device], cudaDevAttrMultiProcessorCount, device));
+        return num_sms[device];
+    }
+
+    int get_cc(int device)
+    {
+        std::lock_guard<std::mutex> lock(mtx);
+        if (!cc[device])
+        {
+            cudaDeviceProp prop;
+            cuda_check(cudaGetDeviceProperties(&prop, device));
+            if (prop.major >= 10) cc[device] = CC_BLACKWELL;
+            else if (prop.major >= 9) cc[device] = CC_HOPPER;
+            else if (prop.major >= 8 && prop.minor >= 9) cc[device] = CC_ADA;
+            else if (prop.major >= 8 && prop.minor >= 6) cc[device] = CC_AMPERE;
+            else cc[device] = CC_OLD;
+        }
+        return cc[device];
+    }
+
+    int* get_locks(int device)
+    {
+        std::lock_guard<std::mutex> lock(mtx);
+        if (!locks[device])
+        {
+            cudaSetDevice(device);
+            cudaMalloc(&locks[device], MAX_TILES_C * sizeof(int));
+            cudaMemset(locks[device], 0, MAX_TILES_C * sizeof(int));
+        }
+        return (int*) locks[device];
+    }
+
+private:
+    DevCtx() = default;
+    DevCtx(const DevCtx&) = delete;
+    DevCtx& operator=(const DevCtx&) = delete;
+};
+
+// Kernel wrapper for bitrates 1..8
+
+template
+<
+    int TILESIZE_M,
+    int TILESIZE_K,
+    int TILESIZE_N,
+    int SH_STAGES,
+    int FRAG_STAGES
+>
+bool launch
+(
+    int K,
+    int num_sms,
+    const half* A_ptr,
+    const uint16_t* B_ptr,
+    half* C_ptr,
+    int size_m,
+    int size_k,
+    int size_n,
+    int* locks,
+    const uint16_t* sv_ptr,
+    cudaStream_t stream
+)
+{
+    if (size_k % TILESIZE_K != 0) return false;
+    if (size_n % TILESIZE_N != 0) return false;
+
+    int max_slices = size_k / TILESIZE_K * size_n / TILESIZE_N / 12;  // decided experimentally, TODO: maybe test more
+    num_sms = MIN(max_slices, num_sms);  // avoid empty blocks
+    int tiles_m = CEIL_DIVIDE(size_m, TILESIZE_M);
+    dim3 blocks(num_sms, tiles_m);
+
+    bool launch_ok = false;
+    static_for_pack<0x01, 0x02, 0x03, 0x04, 0x05, 0x06, 0x07, 0x08, 0x11, 0x12, 0x13, 0x14, 0x15, 0x16, 0x17, 0x18>
+    ([&](auto ic)
+    {
+        constexpr int i = decltype(ic)::value;
+        constexpr int i_b = i & 0x0f;
+        constexpr bool i_h = i & 0x10;
+
+        if (K == i_b && (sv_ptr != nullptr) == i_h)
+        {
+            cudaFuncSetAttribute
+            (
+                exl3_gemm_kernel<i_b, i_h, TILESIZE_M, TILESIZE_K, TILESIZE_N, SH_STAGES, FRAG_STAGES>,
+                cudaFuncAttributeMaxDynamicSharedMemorySize,
+                SMEM_MAX
+            );
+            exl3_gemm_kernel<i_b, i_h, TILESIZE_M, TILESIZE_K, TILESIZE_N, SH_STAGES, FRAG_STAGES>
+            <<<blocks, NUM_THREADS * TILESIZE_K / 16, SMEM_MAX, stream>>>
+            (
+                A_ptr,
+                B_ptr,
+                C_ptr,
+                size_m,
+                size_k,
+                size_n,
+                locks,
+                sv_ptr
+            );
+            cuda_check(cudaPeekAtLastError());
+            launch_ok = true;
+        }
+    });
+    return launch_ok;
+};
+
+int select_kernel(int cc, int size_m, int size_k, int size_n, const uint16_t* sv_ptr);
+
+/*
+
+EXL3 matmul, A @ B -> C
+
+- A: row-major A tensor, shape (m, k), dtype float16, contiguous
+- B: EXL3-quantized B tensor, shape (k//16, n//16, 16*bits), dtype uint16
+- C: empty row-major C tensor, shape (m, n), dtype float16, contiguous. Does not need to be zero-initialized
+- sv: optional, packed output sign flips, shape (n//16), dtype uint16
+
+If temp_A == A and su is not None, input transform is done in-place. EXL3 tensors quantized with the same H (e.g.
+Q, K, V projections in normal transformer) will have the same input sign flips.
+
+limitations:
+- k % 16 == 0
+- k % 128 == 0
+
+*/
+int exl3_gemm
+(
+    const at::Tensor& A,
+    const at::Tensor& B,
+    at::Tensor& C,
+    const c10::optional<at::Tensor>& sv,
+    int force_kernel_idx
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(A.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DIM(B, 3);
+    TORCH_CHECK_SHAPES(A, 1, B, 0, 16);
+    TORCH_CHECK_SHAPES(C, 1, B, 1, 16);
+    TORCH_CHECK_SHAPES(A, 0, C, 0, 1);
+    TORCH_CHECK_DTYPE(A, kHalf);
+    TORCH_CHECK_DTYPE(B, kShort);
+    TORCH_CHECK_DTYPE(C, kHalf);
+
+    // TODO: Input scale here to reduce Python overhead?
+
+    // Get SV, optionally
+    const uint16_t* sv_ptr = (const uint16_t*) OPTPTR(sv);
+    if (sv_ptr) TORCH_CHECK_SHAPES(sv.value(), 0, B, 1, 1);
+
+    // Device properties
+    int device;
+    cudaGetDevice(&device);
+    int num_sms = DevCtx::instance().get_num_sms(device);
+    int cc = DevCtx::instance().get_cc(device);
+    int* locks = DevCtx::instance().get_locks(device);
+
+    // Dispatch
+    int bits = B.size(2) / 16;
+    const half* A_ptr = (const half*) A.data_ptr();
+    const uint16_t* B_ptr = (const uint16_t*) B.data_ptr();
+    half* C_ptr = (half*) C.data_ptr();
+    int size_m = A.size(0);
+    int size_k = A.size(1);
+    int size_n = B.size(1) * 16;
+
+    int selected_kernel;
+    if (force_kernel_idx <= 0)
+        selected_kernel = select_kernel(cc, size_m, size_k, size_n, sv_ptr);
+    else
+        selected_kernel = force_kernel_idx;
+
+    if (!selected_kernel)
+        TORCH_CHECK(false, "exl3_gemm: no compatible kernel");
+
+    bool launched;
+    #define ARGS bits, num_sms, A_ptr, B_ptr, C_ptr, size_m, size_k, size_n, locks, sv_ptr, stream
+    switch (selected_kernel)
+    {
+        //                         tsz_m   tsz_k   tsz_n  sh_st  fr_st  fuse_h
+        case 1: launched = launch<    16,     16,    128,     3,     1>(ARGS); break;
+        case 2: launched = launch<    16,     32,    256,     4,     2>(ARGS); break;
+        case 3: launched = launch<    16,     32,    128,     4,     2>(ARGS); break;
+        case 4: launched = launch<    32,     32,    128,     4,     2>(ARGS); break;
+        case 5: launched = launch<    64,     16,    128,     4,     2>(ARGS); break;
+        case 6: launched = launch<    16,     16,    512,     4,     2>(ARGS); break;
+         default:
+            launched = false;
+            break;
+    }
+
+    return launched ? selected_kernel : 0;
+}
+
+int exl3_gemm_num_kernel_variants()
+{
+    return 6;
+}
+
+// Select kernel based on tensor shape and device props
+
+int select_kernel(int cc, int size_m, int size_k, int size_n, const uint16_t* sv_ptr)
+{
+    bool mod_256 = (size_n % 256 == 0);
+    bool mod_512 = (size_n % 512 == 0);
+
+    switch(cc)
+    {
+        case CC_OLD:
+        case CC_AMPERE:
+            if (size_m > 16) return 4;
+            if (size_n < 2048) return 3;
+            return mod_256 ? 2 : 3;
+
+        case CC_ADA:
+            if (size_m <= 16)
+            {
+                if (size_k * size_n >= 5e7) return mod_256 ? 2 : 3;
+                return 3;
+            }
+            if (size_n * size_k < 8e6) return mod_256 ? 2 : 4;
+            return 4;
+
+        case CC_HOPPER:
+        case CC_BLACKWELL:
+            if (size_m <= 16)
+            {
+                if (size_n >= 65536 && mod_512) return 6;
+                if (size_k * size_n >= 2e8 && mod_512) return 6;
+                if (size_k * size_n >= 5e7) return mod_256 ? 2 : 3;
+                return 3;
+            }
+            if (size_m > 32)
+            {
+                if (size_n * size_k < 8e6) return 4;
+                return 5;
+            }
+            if (size_n * size_k < 8e6) return 3;
+            return 4;
+    }
+    return 0;
+}
--- a/exllamav3/exllamav3_ext/quant/exl3_gemm.cuh
+++ b/exllamav3/exllamav3_ext/quant/exl3_gemm.cuh
@@ -0,0 +1,14 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+int exl3_gemm
+(
+    const at::Tensor& A,
+    const at::Tensor& B,
+    at::Tensor& C,
+    const c10::optional<at::Tensor>& sv,
+    int force_kernel_idx
+);
+
+int exl3_gemm_num_kernel_variants();
--- a/exllamav3/exllamav3_ext/quant/exl3_gemm_kernel.cuh
+++ b/exllamav3/exllamav3_ext/quant/exl3_gemm_kernel.cuh
@@ -0,0 +1,620 @@
+
+template
+<
+    int bits,
+    bool output_had,
+    int TILESIZE_M,
+    int TILESIZE_K,
+    int TILESIZE_N,
+    int SH_STAGES,
+    int FRAG_STAGES
+>
+__global__ __launch_bounds__(NUM_THREADS * TILESIZE_K / 16)
+void exl3_gemm_kernel
+(
+    const half* __restrict__  A,
+    const uint16_t* __restrict__ B,
+    half* __restrict__ C,
+    int size_m,
+    int size_k,
+    int size_n,
+    int* __restrict__ locks,
+    const uint16_t* __restrict__ sv
+)
+{
+    const int TILEBLOCKS_M = TILESIZE_M / 16;
+    const int TILEBLOCKS_K = TILESIZE_K / 16;
+    const int TILEBLOCKS_N = TILESIZE_N / 16;
+    const int FRAGS_M = TILEBLOCKS_M;
+    const int FRAGS_K = TILEBLOCKS_K;
+    const int FRAGS_N_PER_WARP = 2 * TILEBLOCKS_N / (NUM_THREADS / 32);
+
+    const int sh_a_stage_size = TILESIZE_M * TILESIZE_K;                         // in halfs
+    const int sh_b_stage_size = TILEBLOCKS_K * TILEBLOCKS_N * 256 / 16 * bits;   // in uint16s
+    const int sh_c_size = 4 * NUM_THREADS;                                       // in floats
+    // TODO: Maybe flush global->shared pipeline before reduction step so sh_c can share memory with sh_a and sh_b
+
+    // Sanity checks
+    static_assert(NUM_THREADS == 256);
+    static_assert(TILESIZE_M % 16 == 0, "Invalid kernel params");
+    static_assert(TILESIZE_K % 16 == 0, "Invalid kernel params");
+    static_assert(TILESIZE_N % 128 == 0, "Invalid kernel params");
+    static_assert
+    (
+        SMEM_MAX >= SH_STAGES * (2 * sh_a_stage_size + 2 * sh_b_stage_size) + 4 * sh_c_size,
+        "Invalid kernel params (insufficient shared memory for shape)"
+    );
+
+    // Shared memory
+    extern __shared__ half shared[];
+    half* sh_a = shared;
+    uint16_t* sh_b = (uint16_t*) (sh_a + SH_STAGES * sh_a_stage_size);
+    float* sh_c = (float*) (sh_b + sh_b_stage_size * SH_STAGES);
+
+    // Thread index
+    int t = threadIdx.x % NUM_THREADS;
+    int sub_k = threadIdx.x / NUM_THREADS;
+    int warp_id = t / 32;
+    int lane_id = t % 32;
+
+    // Dimensions
+    Dim3 size = { size_m, size_k, size_n };
+    Dim3 tiles = { CEIL_DIVIDE(size_m, TILESIZE_M), size_k / TILESIZE_K, size_n / TILESIZE_N };
+    Dim3 blocks = { 1, tiles.k * TILEBLOCKS_K, tiles.n * TILEBLOCKS_N };
+
+    // Start and end index of current slice, must span at least one tile
+    int num_slices = gridDim.x;
+    int slice_beg = tiles.numel_b() * blockIdx.x / num_slices;
+    int slice_end = tiles.numel_b() * (blockIdx.x + 1) / num_slices;
+    int slice_len = slice_end - slice_beg;
+    if (slice_len < 1) return;
+
+    auto index_m = [&] (int slice_i) { return blockIdx.y; };
+    auto index_k = [&] (int slice_i) { return (slice_i % tiles.k); };
+    auto index_n = [&] (int slice_i) { return (slice_i / tiles.k); };
+
+    // Batch dimension
+    int slice_m = index_m(slice_beg);
+    int max_m = MIN(size_m - slice_m * TILESIZE_M, TILESIZE_M);
+
+    // Pipe 0, global A, B tile and shared A, B tile
+    int slice0_k = index_k(slice_beg);
+    int slice0_n = index_n(slice_beg);
+    int slice0_iters = slice_len;
+
+    int gl_a_stride_m = TILESIZE_M * size_k;
+    const int gl_a_stride_k = TILESIZE_K;
+    const int sh0_a_stride_m = TILESIZE_M * TILESIZE_K;
+    const int sh0_a_stride_k = TILESIZE_K;
+    const half* gl_a_ptr = A + slice_m * gl_a_stride_m + slice0_k * gl_a_stride_k;
+    half* sh0_a_ptr = sh_a + (slice0_iters % SH_STAGES) * sh_a_stage_size;
+
+    const int load_a_iters = CEIL_DIVIDE(sh0_a_stride_m / 8, NUM_THREADS);
+    bool pred_a_gl[load_a_iters];
+    int load_a_gl[load_a_iters];
+    for (int i = 0; i < load_a_iters; ++i)
+    {
+        int k = (i * NUM_THREADS + t) % (gl_a_stride_k / 8);
+        int m = (i * NUM_THREADS + t) / (gl_a_stride_k / 8);
+        load_a_gl[i] = m * size_k / 8 + k;
+        pred_a_gl[i] = m < max_m;
+    }
+
+    int gl_b_stride_k = blocks.n * TILEBLOCKS_K * 256 / 16 * bits;
+    const int gl_b_stride_n = TILEBLOCKS_N * 256 / 16 * bits;
+    const int sh0_b_stride_k = TILEBLOCKS_K * TILEBLOCKS_N * 256 / 16 * bits;
+    const int sh0_b_stride_n = TILEBLOCKS_N * 256 / 16 * bits;
+    const uint16_t* gl_b_ptr = B + slice0_k * gl_b_stride_k + slice0_n * gl_b_stride_n;
+    uint16_t* sh0_b_ptr = sh_b + (slice0_iters % SH_STAGES) * sh_b_stage_size;
+
+    const int load_b_iters = CEIL_DIVIDE(sh0_b_stride_k / 8, NUM_THREADS);
+    bool pred_b_gl[load_b_iters];
+    int load_b_gl[load_b_iters];
+    for (int i = 0; i < load_b_iters; ++i)
+    {
+        int n = (i * NUM_THREADS + t) % (gl_b_stride_n / 8);
+        int k = (i * NUM_THREADS + t) / (gl_b_stride_n / 8);
+        load_b_gl[i] = k * blocks.n * 256 / 16 * bits / 8 * k + n;
+        pred_b_gl[i] = i * NUM_THREADS + t < sh0_b_stride_k / 8;
+    }
+
+    auto advance0 = [&] ()
+    {
+        slice0_k++;
+        slice0_iters--;
+
+        int stage = slice0_iters % SH_STAGES;
+        sh0_a_ptr = sh_a + stage * sh_a_stage_size;
+        sh0_b_ptr = sh_b + stage * sh_b_stage_size;
+
+        if (slice0_k >= tiles.k)
+        {
+            slice0_k = 0;
+            slice0_n++;
+            gl_a_ptr = A + slice_m * gl_a_stride_m + slice0_k * gl_a_stride_k;
+            gl_b_ptr = B + slice0_k * gl_b_stride_k + slice0_n * gl_b_stride_n;
+        }
+        else
+        {
+            gl_a_ptr += gl_a_stride_k;
+            gl_b_ptr += gl_b_stride_k;
+        }
+    };
+
+    // Pipe 1, shared A, B tile and registers
+    int slice1_k = slice0_k;
+    int slice1_n = slice0_n;
+    int slice1_iters = slice0_iters;
+
+    half* sh1_a_ptr = sh_a + (slice1_iters % SH_STAGES) * sh_a_stage_size;
+    uint16_t* sh1_b_ptr = sh_b + (slice1_iters % SH_STAGES) * sh_b_stage_size;
+
+    auto advance1 = [&] ()
+    {
+        slice1_k++;
+        slice1_iters--;
+
+        int stage = slice1_iters % SH_STAGES;
+        sh1_a_ptr = sh_a + stage * sh_a_stage_size;
+        sh1_b_ptr = sh_b + stage * sh_b_stage_size;
+
+        if (slice1_k >= tiles.k)
+        {
+            slice1_k = 0;
+            slice1_n++;
+        }
+    };
+
+    // Pipe 2
+    int slice2_k = slice0_k;
+    int slice2_k0 = slice0_k;
+    int slice2_n = slice0_n;
+    int slice2_iters = slice0_iters;
+
+    int gl_c_stride_n = TILESIZE_N;
+    int gl_c_stride_m = TILESIZE_M * size_n;
+    half* gl_c_ptr = C + slice_m * gl_c_stride_m + slice2_n * gl_c_stride_n;
+
+    register FragA frag_a[FRAG_STAGES][FRAGS_M];
+    register FragB frag_b[FRAG_STAGES][FRAGS_N_PER_WARP];
+    register FragC frag_c[FRAGS_M][FRAGS_N_PER_WARP];
+
+    auto advance2 = [&] ()
+    {
+        slice2_k++;
+        slice2_iters--;
+
+        if (slice2_k >= tiles.k)
+        {
+            slice2_k = 0;
+            slice2_k0 = 0;
+            slice2_n++;
+            gl_c_ptr += gl_c_stride_n;
+        }
+    };
+
+    // Schedule load of the next A, B tiles to shared memory and advance the pipeline
+    auto async_load_gl = [&] ()
+    {
+        if (sub_k)
+        {
+            cp_async_fence();
+            return;
+        }
+
+        if (slice0_iters)
+        {
+            // Copy tile from row-major A matrix
+            {
+                const int4* gl = (const int4*) gl_a_ptr;
+                int4* sh = (int4*) sh0_a_ptr;
+                #pragma unroll
+                for (int i = 0; i < load_a_iters; ++i)
+                {
+                    // TODO: Rearrange into ldmatrix friendly layout while loading?
+                    // @p seems to crash on Blackwell but does not perform better on Ampere and Ada anyway
+                    // cp_async_pred(sh + NUM_THREADS * i + t, gl + load_a_gl[i], pred_a_gl[i]);
+                    if (pred_a_gl[i]) cp_async(sh + NUM_THREADS * i + t, gl + load_a_gl[i]);
+                }
+            }
+
+            // Copy tile of 256-element blocks from quantized B matrix
+            {
+                const int4* gl = (const int4*) gl_b_ptr;
+                int4* sh = (int4*) sh0_b_ptr;
+                #pragma unroll
+                for (int i = 0; i < load_b_iters; ++i)
+                {
+                    // @p seems to crash on Blackwell but does not perform better on Ampere and Ada anyway
+                    // cp_async_pred(sh + NUM_THREADS * i + t, gl + load_b_gl[i], pred_b_gl[i]);
+                    if (pred_b_gl[i]) cp_async(sh + NUM_THREADS * i + t, gl + load_b_gl[i]);
+                }
+            }
+            advance0();
+        }
+
+        // Sync and advance
+        cp_async_fence();
+    };
+
+    // Load fragments
+    // Ref. for fragment layout:
+    // https://docs.nvidia.com/cuda/parallel-thread-execution/index.html#matrix-fragments-for-mma-m16n8k16-with-floating-point-type
+    auto load_frags = [&] (int buf)
+    {
+        if (!slice1_iters) return;
+
+        // A fragments
+        {
+            // TODO: Resolve bank conflicts
+            int r = (lane_id % 8) + 8 * ((lane_id / 8) % 2);
+            int c = lane_id / 16;
+            int4* sha = (int4*) sh1_a_ptr + r * TILESIZE_K / 8 + c;
+            #pragma unroll
+            for (int m = 0; m < TILEBLOCKS_M; ++m)
+                ldsm4(frag_a[buf][m], sha + (m * 16) * TILESIZE_K / 8 + sub_k * 16 / 8);
+        }
+
+        // B fragments
+        int r0 = lane_id / 2;
+        int c0 = (lane_id % 2) * 8;
+
+        #pragma unroll
+        for (int n2 = 0; n2 < FRAGS_N_PER_WARP; n2 += 2)
+        {
+            int sub_n2 = warp_id * FRAGS_N_PER_WARP / 2 + n2 / 2;
+            const uint32_t* shb = (const uint32_t*) (sh1_b_ptr + (sub_k * TILEBLOCKS_N + sub_n2) * 256 / 16 * bits);
+
+            dq_dispatch<bits>(shb, r0 * 16 + c0, frag_b[buf][n2], frag_b[buf][n2 + 1]);
+        }
+
+        __syncthreads();
+        advance1();
+    };
+
+    // Clear C fragments
+    auto clear_frag_c = [&] ()
+    {
+        #pragma unroll
+        for (int m = 0; m < FRAGS_M; ++m)
+            #pragma unroll
+            for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+                frag_c[m][n] = {};
+    };
+
+    // Threadblock reduction
+    auto threadblock_reduce = [&] ()
+    {
+        auto store = [&] (int i, int m, int n)
+        {
+            // TODO: Shuffle to avoid bank conflicts here? Doesn't seem to be a bottleneck
+            // TODO: Always accumulates entire C fragment, could be limited when size_m < 16
+            if (sub_k == i)
+            {
+                float* sh_red = sh_c + (FRAGS_N_PER_WARP * 4) * t;
+                for (int i = 0; i < 4; ++i)
+                    *sh_red++ = frag_c[m][n][i];
+            }
+            __syncthreads();
+        };
+
+        auto add = [&] (int i, int m, int n)
+        {
+            if (sub_k == i)
+            {
+                float* sh_red = sh_c + (FRAGS_N_PER_WARP * 4) * t;
+                for (int i = 0; i < 4; ++i)
+                    frag_c[m][n][i] += *sh_red++;
+            }
+            __syncthreads();
+        };
+
+        for (int m = 0; m < FRAGS_M; ++m)
+        for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+        {
+            if constexpr (TILEBLOCKS_K == 2)
+            {
+                store(1, m, n);
+                add(0, m, n);
+            }
+            if constexpr (TILEBLOCKS_K == 3)
+            {
+                store(1, m, n);
+                add(0, m, n);
+                store(2, m, n);
+                add(0, m, n);
+            }
+            if constexpr (TILEBLOCKS_K == 4)
+            {
+                store(3, m, n);
+                add(2, m, n);
+                store(1, m, n);
+                add(0, m, n);
+                store(2, m, n);
+                add(0, m, n);
+            }
+        }
+    };
+
+    // Output hadamard transform
+    auto apply_output_had = [&] ()
+    {
+        auto shuffle_had_fx32 = [&](float v)
+        {
+            for (int i = 1; i < 32; i <<= 1)
+            {
+                float pv = __shfl_xor_sync(0xffffffff, v, i);
+                uint32_t* vi = reinterpret_cast<uint32_t*>(&v);
+                int32_t sfm = -static_cast<int16_t>(lane_id & i) >> 31;
+                *vi ^= (sfm & 0x80000000);
+                v = v + pv;
+            }
+            return v;
+        };
+
+        // Operates on 1x128 tiles
+        int n_tiles = TILESIZE_N / 128;
+        int m_tiles = max_m;
+        int num_tiles = n_tiles * m_tiles;
+
+        // One warp per tile
+        int tile_idx = threadIdx.x / 32;
+        int tile_stride = blockDim.x / 32;
+
+        while (tile_idx < num_tiles)
+        {
+            // Offset of tile slice
+            int tile_i_x = tile_idx % n_tiles;
+            int tile_i_y = tile_idx / n_tiles;
+            half4* c_ptr = (half4*) (gl_c_ptr + size_n * tile_i_y + 128 * tile_i_x);
+
+            // Load
+            half4 v = c_ptr[lane_id];
+
+            // 4 element had
+            float v0 = __half2float(__low2half(v.x));
+            float v1 = __half2float(__high2half(v.x));
+            float v2 = __half2float(__low2half(v.y));
+            float v3 = __half2float(__high2half(v.y));
+            float h0 = v0 + v1 + v2 + v3;
+            float h1 = v0 - v1 + v2 - v3;
+            float h2 = v0 + v1 - v2 - v3;
+            float h3 = v0 - v1 - v2 + v3;
+
+            // 32 element had, warp shuffle
+            h0 = shuffle_had_fx32(h0);
+            h1 = shuffle_had_fx32(h1);
+            h2 = shuffle_had_fx32(h2);
+            h3 = shuffle_had_fx32(h3);
+            h0 *= 0.088388347648f;  // 1/sqrt(128)
+            h1 *= 0.088388347648f;
+            h2 *= 0.088388347648f;
+            h3 *= 0.088388347648f;
+            v.x = __floats2half2_rn(h0, h1);
+            v.y = __floats2half2_rn(h2, h3);
+
+            // Sign flips
+            int i = (TILESIZE_N / 128 * slice2_n + tile_i_x) * 32 + lane_id;
+            uint32_t signs = (uint32_t) (sv[i / 4] >> ((i % 4) * 4));  // TODO: preload to smem (if bottleneck)
+            v.x = h2xor(v.x, ((signs & 1) << 15) | ((signs & 2) << 30));  // TODO: pre-unpack (if bottleneck)
+            v.y = h2xor(v.y, ((signs & 4) << 13) | ((signs & 8) << 28));
+
+            // Store
+            c_ptr[lane_id] = v;
+
+            // Advance
+            tile_idx += tile_stride;
+        }
+    };
+
+    // Output reduction
+    auto reduce = [&] ()
+    {
+        // First reduce all partial sums along k for the current slice
+        threadblock_reduce();
+
+        // Process (partial) slices within column in reverse order so the threadblock doing the bottom slice is
+        // free to proceed to the next column right away
+        int lock_i = tiles.k - slice2_k - 1;
+        int lock_d = slice2_k - slice2_k0 + 1;
+        int* lock = &locks[slice_m * blocks.n + slice2_n];
+
+        barrier_acquire(lock, lock_i);
+
+        bool first = lock_i == 0;
+        bool last = lock_i + lock_d == tiles.k;
+
+        int n0 = warp_id * FRAGS_N_PER_WARP;
+
+        // Second and subsequent threadblocks in column read back the intermediate sum from global memory
+        // TODO: Use an intermediate layout to make these writes coalesce
+        if (!sub_k && !first)
+        {
+            for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+            {
+                for (int m = 0; m < FRAGS_M; ++m)
+                {
+                    int r0 = lane_id / 4 + 16 * m;
+                    int r1 = r0 + 8;
+                    int c = (lane_id % 4) * 2;
+                    if (r0 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r0 * size_n + (n0 + n) * 8 + c);
+                        float2 interm = __half22float2(*c_ptr);
+                        frag_c[m][n][0] += interm.x;
+                        frag_c[m][n][1] += interm.y;
+                    }
+                    if (r1 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r1 * size_n + (n0 + n) * 8 + c);
+                        float2 interm = __half22float2(*c_ptr);
+                        frag_c[m][n][2] += interm.x;
+                        frag_c[m][n][3] += interm.y;
+                    }
+                }
+            }
+        }
+
+        // All but last threadblock in column threadblocks write the intermediate result to global memory
+        if (!sub_k && !last)
+        {
+            for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+            {
+                for (int m = 0; m < FRAGS_M; ++m)
+                {
+                    int r0 = lane_id / 4 + 16 * m;
+                    int r1 = r0 + 8;
+                    int c = (lane_id % 4) * 2;
+                    if (r0 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r0 * size_n + (n0 + n) * 8 + c);
+                        half2 sum = __floats2half2_rn(frag_c[m][n][0], frag_c[m][n][1]);
+                        *c_ptr = sum;
+                    }
+                    if (r1 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r1 * size_n + (n0 + n) * 8 + c);
+                        half2 sum = __floats2half2_rn(frag_c[m][n][2], frag_c[m][n][3]);
+                        *c_ptr = sum;
+                    }
+                }
+            }
+        }
+
+        // Last block writes in row-major format and performs output hadamard transform
+        if (!sub_k && last)
+        {
+            for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+            {
+                for (int m = 0; m < FRAGS_M; ++m)
+                {
+                    int r0 = lane_id / 4 + 16 * m;
+                    int r1 = r0 + 8;
+                    int c = (lane_id % 4) * 2;
+                    if (r0 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r0 * size_n + (n0 + n) * 8 + c);
+                        half2 sum = __floats2half2_rn(frag_c[m][n][0], frag_c[m][n][1]);
+                        *c_ptr = sum;
+                    }
+                    if (r1 < max_m)
+                    {
+                        half2* c_ptr = (half2*) (gl_c_ptr + r1 * size_n + (n0 + n) * 8 + c);
+                        half2 sum = __floats2half2_rn(frag_c[m][n][2], frag_c[m][n][3]);
+                        *c_ptr = sum;
+                    }
+                }
+            }
+        }
+
+        // Last block also performs output hadamard (using all threads)
+        if (last)
+        {
+            // TODO: Determine if this is a bottleneck, could maybe be done in smem or registers
+            if constexpr (output_had)
+            {
+                __syncthreads();
+                apply_output_had();
+            }
+        }
+
+        barrier_release(lock, lock_d, last);
+
+        clear_frag_c();
+    };
+
+    // Wait until there are at most SH_STAGES - 2 async copies pending, i.e. at least one stage has finished loading
+    auto wait_stage = [&] ()
+    {
+        cp_async_wait<SH_STAGES - 2>();
+        __syncthreads();
+    };
+
+    // Perform tensor core matmul on current tile
+    auto matmul = [&] (int buf)
+    {
+        for (int m = 0; m < FRAGS_M; ++m)
+            for (int n = 0; n < FRAGS_N_PER_WARP; ++n)
+                ptx_mma_m16n8k16(frag_a[buf][m], frag_b[buf][n], frag_c[m][n]);
+    };
+
+    // Start global to shared pipeline
+    for (int i = 0; i < SH_STAGES - 1; ++i)
+        async_load_gl();
+    wait_stage();
+
+    // Start shared to register pipeline.
+    clear_frag_c();
+    if constexpr (FRAG_STAGES > 1)
+        load_frags(0);
+
+    // Main loop. Fragments are double buffered to allow more interleaving. This is especially important to hide the
+    // dequantization overhead, but we need two different iterations of the main loop to avoid confusing the compiler
+    // and making it (sometimes) place the fragment arrays in local memory
+
+    if constexpr (FRAG_STAGES == 1)
+    {
+        while (true)
+        {
+            async_load_gl();
+            wait_stage();
+            load_frags(0);
+            matmul(0);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+        }
+    }
+
+    if constexpr (FRAG_STAGES == 2)
+    {
+        while (true)
+        {
+            async_load_gl();
+            wait_stage();
+            load_frags(1);
+            matmul(0);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+
+            async_load_gl();
+            wait_stage();
+            load_frags(0);
+            matmul(1);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+        }
+    }
+
+    if constexpr (FRAG_STAGES == 3)
+    {
+        while (true)
+        {
+            async_load_gl();
+            wait_stage();
+            load_frags(1);
+            matmul(0);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+
+            async_load_gl();
+            wait_stage();
+            load_frags(2);
+            matmul(1);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+
+            async_load_gl();
+            wait_stage();
+            load_frags(0);
+            matmul(2);
+            if (slice2_k == tiles.k - 1 || slice2_iters == 1) { reduce(); slice2_k0 = slice2_k + 1; }
+            advance2();
+            if (!slice2_iters) break;
+        }
+    }
+}
--- a/exllamav3/exllamav3_ext/quant/hadamard.cu
+++ b/exllamav3/exllamav3_ext/quant/hadamard.cu
@@ -0,0 +1,210 @@
+#include "quantize.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "../util.h"
+#include "../util.cuh"
+
+__device__ inline half hreduce(half2 x)
+{
+    return __hadd(__low2half(x), __high2half(x));
+}
+
+__device__ inline float shuffle_had_fx32(float v, int lane_id)
+{
+    for (int i = 1; i < 32; i <<= 1)
+    {
+        float pv = __shfl_xor_sync(0xffffffff, v, i);
+        uint32_t* vi = reinterpret_cast<uint32_t*>(&v);
+        int32_t sfm = -static_cast<int16_t>(lane_id & i) >> 31;
+        *vi ^= (sfm & 0x80000000);
+        v = v + pv;
+    }
+    return v;
+}
+
+__device__ inline half2 shuffle_had_h2x32(half2 v, int lane_id)
+{
+    for (int i = 1; i < 32; i <<= 1)
+    {
+        half2 pv = __shfl_xor_sync(0xffffffff, v, i);
+        uint32_t* vi = reinterpret_cast<uint32_t*>(&v);
+        int32_t sfm = -static_cast<int16_t>(lane_id & i) >> 31;
+        *vi ^= (sfm & 0x80008000);
+        v = __hadd2(v, pv);
+    }
+    return v;
+}
+
+__global__ __launch_bounds__(32)
+void hadh_r_128_kernel
+(
+    const half* __restrict__ input_ptr,
+    half* __restrict__ output_ptr,
+    const half* __restrict__ pre_scale,
+    const half* __restrict__ post_scale
+)
+{
+    int t = threadIdx.x;
+    input_ptr += gridDim.y * 128 * blockIdx.x + blockIdx.y * 128;
+    output_ptr += gridDim.y * 128 * blockIdx.x + blockIdx.y * 128;
+
+    // Load
+    half4 v = ((half4*) input_ptr)[t];
+
+    // Prescale
+    if (pre_scale)
+    {
+        pre_scale += blockIdx.y * 128;
+        half4 s = ((half4*) pre_scale)[t];
+        v.x = __h2div(v.x, s.x);
+        v.y = __h2div(v.y, s.y);
+    }
+
+    // 4 element had
+    half2 vxpp = v.x;
+    half2 vxpn = h2xor(vxpp, 0x80000000);
+    half2 vypp = v.y;
+    half2 vypn = h2xor(vypp, 0x80000000);
+    half h0 = hreduce(__hadd2(vxpp, vypp));
+    half h1 = hreduce(__hadd2(vxpn, vypn));
+    half h2 = hreduce(__hsub2(vxpp, vypp));
+    half h3 = hreduce(__hsub2(vxpn, vypn));
+    v.x = __halves2half2(h0, h1);
+    v.y = __halves2half2(h2, h3);
+
+    // 32 element had, warp shuffle
+    v.x = shuffle_had_h2x32(v.x, t);
+    v.y = shuffle_had_h2x32(v.y, t);
+
+    // Rescale by 1/sqrt(128)
+    half2 f = __halves2half2(__float2half_rn(0.088388347648), __float2half_rn(0.088388347648));
+    v.x = __hmul2(v.x, f);
+    v.y = __hmul2(v.y, f);
+
+    // Postscale
+    if (post_scale)
+    {
+        post_scale += blockIdx.y * 128;
+        half4 s = ((half4*) post_scale)[t];
+        v.x = __h2div(v.x, s.x);
+        v.y = __h2div(v.y, s.y);
+    }
+
+    // Store
+    ((half4*) output_ptr)[t] = v;
+}
+
+__global__ __launch_bounds__(32)
+void hadf_r_128_kernel
+(
+    const half* __restrict__ input_ptr,
+    half* __restrict__ output_ptr,
+//    const uint16_t* __restrict__ pre_flip,
+    const half* __restrict__ pre_scale,
+    const uint16_t* __restrict__ post_flip,
+    float r_scale
+)
+{
+    int t = threadIdx.x;
+    input_ptr += gridDim.y * 128 * blockIdx.x + blockIdx.y * 128;
+    output_ptr += gridDim.y * 128 * blockIdx.x + blockIdx.y * 128;
+
+    // Load
+    half4 v = ((half4*) input_ptr)[t];
+
+    // Pre flip
+//    if (pre_flip)
+//    {
+//        int i = blockIdx.y * 32 + t;
+//        uint32_t signs = (uint32_t) (pre_flip[i / 4] >> ((i % 4) * 4));
+//        v.x = h2xor(v.x, ((signs & 1) << 15) | ((signs & 2) << 30));
+//        v.y = h2xor(v.y, ((signs & 4) << 13) | ((signs & 8) << 28));
+//    }
+    __syncthreads();
+
+    if (pre_scale)
+    {
+        int i = blockIdx.y * 32 + t;
+        half4 scales = ((half4*) pre_scale)[i];
+        v.x = __hmul2(v.x, scales.x);
+        v.y = __hmul2(v.y, scales.y);
+    }
+    __syncthreads();
+
+    // 4 element had
+    float v0 = __half2float(__low2half(v.x));
+    float v1 = __half2float(__high2half(v.x));
+    float v2 = __half2float(__low2half(v.y));
+    float v3 = __half2float(__high2half(v.y));
+    float h0 = v0 + v1 + v2 + v3;
+    float h1 = v0 - v1 + v2 - v3;
+    float h2 = v0 + v1 - v2 - v3;
+    float h3 = v0 - v1 - v2 + v3;
+
+    // 32 element had, warp shuffle
+    h0 = shuffle_had_fx32(h0, t);
+    h1 = shuffle_had_fx32(h1, t);
+    h2 = shuffle_had_fx32(h2, t);
+    h3 = shuffle_had_fx32(h3, t);
+    h0 *= r_scale;
+    h1 *= r_scale;
+    h2 *= r_scale;
+    h3 *= r_scale;
+    v.x = __floats2half2_rn(h0, h1);
+    v.y = __floats2half2_rn(h2, h3);
+
+    // Post flip
+    if (post_flip)
+    {
+        int i = blockIdx.y * 32 + t;
+        uint32_t signs = (uint32_t) (post_flip[i / 4] >> ((i % 4) * 4));
+        v.x = h2xor(v.x, ((signs & 1) << 15) | ((signs & 2) << 30));
+        v.y = h2xor(v.y, ((signs & 4) << 13) | ((signs & 8) << 28));
+    }
+
+    // Store
+    ((half4*) output_ptr)[t] = v;
+}
+
+/*
+Compute y = (x.view(-1, 128) @ had_128).view(x.shape)
+Works inplace if y == x
+*/
+void had_r_128
+(
+    const at::Tensor& input,
+    const at::Tensor& output,
+//    const c10::optional<at::Tensor>& pre_flip,
+    const c10::optional<at::Tensor>& pre_scale,
+    const c10::optional<at::Tensor>& post_flip,
+    float scale
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(input.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DIM(input, 2);
+    TORCH_CHECK_SHAPES_FULL(input, output);
+    TORCH_CHECK_DTYPE(input, kHalf);
+    TORCH_CHECK_DTYPE(output, kHalf);
+    TORCH_CHECK_DIV(input, 1, 128);
+
+    int rows = input.size(0);
+    int cols = input.size(1);
+    int blocks = cols / 128;
+    float r_scale = scale * 0.088388347648f; // scale / sqrt(128)
+
+    dim3 blockDim(32);
+    dim3 gridDim(rows, blocks);
+
+    hadf_r_128_kernel<<<gridDim, blockDim, 0, stream>>>
+    (
+        (const half*) input.data_ptr(),
+        (half*) output.data_ptr(),
+//        (const uint16_t*) OPTPTR(pre_flip),
+        (const half*) OPTPTR(pre_scale),
+        (const uint16_t*) OPTPTR(post_flip),
+        r_scale
+    );
+}
--- a/exllamav3/exllamav3_ext/quant/hadamard.cuh
+++ b/exllamav3/exllamav3_ext/quant/hadamard.cuh
@@ -0,0 +1,12 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void had_r_128
+(
+    const at::Tensor& input,
+    const at::Tensor& output,
+    const c10::optional<at::Tensor>& pre_flip,
+    const c10::optional<at::Tensor>& post_flip,
+    float scale
+);
--- a/exllamav3/exllamav3_ext/quant/pack.cu
+++ b/exllamav3/exllamav3_ext/quant/pack.cu
@@ -0,0 +1,220 @@
+#include "quantize.cuh"
+#include <c10/cuda/CUDAGuard.h>
+#include <ATen/cuda/CUDAContext.h>
+#include <cuda_fp16.h>
+#include "../util.h"
+#include "../util.cuh"
+#include "codebook.cuh"
+
+template <int K>
+__global__ __launch_bounds__(128)
+void pack_trellis_kernel
+(
+    uint16_t* __restrict__ g_packed,
+    const uint16_t* __restrict__ g_unpacked
+)
+{
+    constexpr int packed_size = 256 * K / 16;
+    __shared__ uint16_t s_unpacked[256];
+    __shared__ uint16_t s_packed[packed_size];
+
+    int t = threadIdx.x;
+    g_packed += (gridDim.x * blockIdx.y + blockIdx.x) * packed_size;
+    g_unpacked += (gridDim.x * blockIdx.y + blockIdx.x) * 256;
+
+    ((uint32_t*) s_unpacked)[t] = ((uint32_t*) g_unpacked)[t];
+    __syncthreads();
+
+    // 16 spans of 16 weights to guarantee alignment for any K
+    const int spans = 16;
+    const int len = 256 / spans;
+    if (t < spans)
+    {
+        int i = len * t;
+        int j = K * t;
+        int k = 32;
+        uint32_t buf = 0;
+        for (int n = 0; n < len; ++n)
+        {
+            uint32_t v = (uint32_t) s_unpacked[i];
+            v &= ((1 << K) - 1);
+            k -= K;
+            buf |= (v << k);
+            if (k <= 16)
+            {
+                s_packed[j] = (uint16_t) (buf >> 16);
+                buf <<= 16;
+                k += 16;
+                j++;
+            }
+            i++;
+        }
+    }
+    __syncthreads();
+
+    if (t < packed_size / 2)
+        ((uint32_t*) g_packed)[t] = SWAP16(((uint32_t*) s_packed)[t]);;
+}
+
+void pack_trellis
+(
+    at::Tensor packed,
+    at::Tensor unpacked,
+    int K
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(unpacked.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_SHAPES(packed, 0, unpacked, 0, 1);
+    TORCH_CHECK_SHAPES(packed, 1, unpacked, 1, 1);
+    TORCH_CHECK_SIZE(unpacked, 2, 256);
+    TORCH_CHECK_SIZE(packed, 2, 256 * K / 16);
+
+    int rows = packed.size(0);
+    int cols = packed.size(1);
+
+    dim3 blockDim(128);
+    dim3 gridDim(rows, cols);
+
+    static_for_pack<1, 2, 3, 4, 5, 6, 7, 8>([&](auto ic)
+    {
+        constexpr int i = decltype(ic)::value;
+        if (K == i)
+            pack_trellis_kernel<i><<<gridDim, blockDim, 0, stream>>>
+            (
+                (uint16_t*) packed.data_ptr(),
+                (const uint16_t*) unpacked.data_ptr()
+            );
+    });
+}
+
+template <int K>
+__global__ __launch_bounds__(128)
+void unpack_trellis_kernel
+(
+    uint16_t* __restrict__ g_unpacked,
+    const uint16_t* __restrict__ g_packed
+)
+{
+    constexpr int packed_size = 256 * K / 16;
+    __shared__ uint16_t s_packed[packed_size];
+
+    int t = threadIdx.x;
+    g_packed += (gridDim.x * blockIdx.y + blockIdx.x) * packed_size;
+    g_unpacked += (gridDim.x * blockIdx.y + blockIdx.x) * 256;
+
+    // Read packed tile
+    if (t < packed_size / 2)
+        ((uint32_t*) s_packed)[t] = ((uint32_t*) g_packed)[t];
+    __syncthreads();
+
+    // Index two words
+    int b0 = t * 2 * K + K - 16 + 256 * K;          // start of word0
+    int b1 = b0 + K;                                // start of word1
+    int b2 = b1 + 16;                               // end of word1
+    int i0 = b0 / 32;                               // uint32 containing first bit of word0
+    int i1 = (b2 - 1) / 32;                         // uint32 containing last bit of word1, may be == i0
+    int s1 = (i1 + 1) * 32 - b2;                    // shift to align word1 to 32-bit boundary
+
+    // Load 32-64 bits containing word0 and word1, overlapping by 16-K bits, correct for endianness
+    uint32_t a = ((uint32_t*) s_packed)[i0 % (K * 256 / 32)];
+    uint32_t b = ((uint32_t*) s_packed)[i1 % (K * 256 / 32)];
+//    a = SWAP16(a);
+//    b = SWAP16(b);
+
+    // Shift into place
+    uint32_t w1 = __funnelshift_r(b, a, s1);
+    uint32_t w0 = w1 >> K;
+    w0 &= 0xffff;
+    w1 &= 0xffff;
+
+    // Store
+    uint32_t word01 = (w1 << 16) | w0;
+    ((uint32_t*)g_unpacked)[t] = word01;
+}
+
+void unpack_trellis
+(
+    at::Tensor unpacked,
+    at::Tensor packed,
+    int K
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(unpacked.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_SHAPES(packed, 0, unpacked, 0, 1);
+    TORCH_CHECK_SHAPES(packed, 1, unpacked, 1, 1);
+    TORCH_CHECK_SIZE(unpacked, 2, 256);
+    TORCH_CHECK_SIZE(packed, 2, 256 * K / 16);
+
+    int rows = packed.size(0);
+    int cols = packed.size(1);
+
+    dim3 blockDim(128);
+    dim3 gridDim(cols, rows);
+
+    static_for_pack<1, 2, 3, 4, 5, 6, 7, 8>([&](auto ic)
+    {
+        constexpr int i = decltype(ic)::value;
+        if (K == i)
+            unpack_trellis_kernel<i><<<gridDim, blockDim, 0, stream>>>
+            (
+                (uint16_t*) unpacked.data_ptr(),
+                (const uint16_t*) packed.data_ptr()
+            );
+    });
+}
+
+__global__ __launch_bounds__(32)
+void pack_signs_kernel
+(
+    uint16_t* __restrict__ g_packed,
+    const uint16_t* __restrict__ g_unpacked,
+    int cols
+)
+{
+    int t = threadIdx.x;
+    int idx = 32 * blockIdx.x + t;
+    if (idx >= cols) return;
+    g_unpacked += 16 * idx;
+    g_packed += idx;
+
+    // Not efficient but whatever
+    uint16_t out = 0;
+    for (int i = 0; i < 16; ++i)
+    {
+        uint16_t v = *g_unpacked++;
+        v &= 0x8000;
+        out >>= 1;
+        out |= v;
+    }
+
+    *g_packed = out;
+}
+
+void pack_signs
+(
+    at::Tensor packed,
+    at::Tensor unpacked
+)
+{
+    const at::cuda::OptionalCUDAGuard device_guard(unpacked.device());
+    cudaStream_t stream = at::cuda::getCurrentCUDAStream().stream();
+
+    TORCH_CHECK_DTYPE(unpacked, kHalf);
+    TORCH_CHECK_DTYPE(packed, kShort);
+
+    int cols = packed.size(0);
+    dim3 blockDim(32);
+    dim3 gridDim(CEIL_DIVIDE(cols, 32));
+
+    pack_signs_kernel<<<gridDim, blockDim, 0, stream>>>
+    (
+        (uint16_t*) packed.data_ptr(),
+        (const uint16_t*) unpacked.data_ptr(),
+        cols
+    );
+}
+
--- a/exllamav3/exllamav3_ext/quant/pack.cuh
+++ b/exllamav3/exllamav3_ext/quant/pack.cuh
@@ -0,0 +1,23 @@
+#pragma once
+
+#include <ATen/Tensor.h>
+
+void pack_trellis
+(
+    at::Tensor packed,
+    at::Tensor unpacked,
+    int K
+);
+
+void unpack_trellis
+(
+    at::Tensor unpacked,
+    at::Tensor packed,
+    int K
+);
+
+void pack_signs
+(
+    at::Tensor packed,
+    at::Tensor unpacked
+);
--- a/Show More
+++ b/Show More