🚀 MoQ: Mixture of Quants

image

MoQ (Mixture of Quants) is a smart way to shrink AI models without losing their "brainpower." Unlike old methods that treat every part of the model the same, MoQ identifies the most important parts and keeps them high-quality, while heavily compressing the rest to save space.**Stop settling for uniform bitrates. Standard quantization is a relic of the past, treating vital cognitive weights the same as redundant noise. **


The result? A model that punches significantly above its weight class.

Benjamin Marie evaluated MoQ GGUFs ("Mixture of Quants") against Unsloth Dynamic (UD) quants, focusing on low-bit versions below 4 bits on average — the range where GGUF models typically struggle most. Results: At similar bits-per-weight (Bpw), MoQ outperforms Unsloth Dynamic quants by ~10% on benchmarks, while also being roughly 2× more token-efficient on average.

"MoQ models are much better than UD quants on benchmarks, and they are also more token-efficient."

image

image


Table

Folder Link BPW Total Size Description
📂 BF16 16 17.92 GB
📂 F16 16 17.92 GB
📂 MoQ-Quants-Latest 3.2 3.58 GB
📂 MoQ-Quants-Latest 3.6 4.03 GB
📂 MoQ-Quants-Latest 3.8 4.22 GB
📂 MoQ-Quants-Latest 4.1 4.64 GB
📂 MoQ-Quants-Latest 4.3 4.84 GB
📂 MoQ-Quants-Latest 4.6 5.11 GB
📂 MoQ-Quants-Latest 4.8 5.37 GB
📂 MoQ-Quants-Latest 4.9 5.49 GB
📂 MoQ-Quants-Latest 5.1 5.75 GB
📂 MoQ-Quants-Latest 5.3 5.92 GB
📂 MoQ-Quants-Latest 6.6 7.36 GB
📂 MoQ-mmproj 16.0 0.92 GB

🧠 The MoQ Edge

MoQ optimizes the architecture for the Pareto frontier of memory and performance.

  • Dynamic Bitrate Allocation: No more "one-size-fits-all." MoQ assigns precision where it actually matters.
  • Cognitive Preservation: Massive VRAM savings with near-zero degradation in logic and coherence.
  • Next-Gen Efficiency: Fits "Large" model intelligence into "Small" model hardware.

Comparison

Here is the comparison between MoQ and Unsloth dynamic quants. MoQ perform better i guess . Performed on wiki text (benchmaxxing is not allowed!!!)

download

x : https://x.com/WaleedAhmad1a10 If MoQ does not perform well, email me : waleedahmad.1a10@gmail.com

🛠 Usage & Deployment.

./llama-cli -m Qwen3.5-9B-MoQ-4.85.gguf -p "The future of efficient AI is..."
Downloads last month
49,653
GGUF
Model size
9B params
Architecture
qwen35
Hardware compatibility
Log In to add your hardware

16-bit

Inference Providers NEW
This model isn't deployed by any Inference Provider. 🙋 Ask for provider support

Model tree for w-ahmad/Qwen3.5-9B-GGUF-MoQ

Finetuned
Qwen/Qwen3.5-9B
Quantized
(259)
this model