Ternary-Quant

Post-training quantization to {−1, 0, +1} — works on VLMs and seq2seq, not just LLMs.

Select a model, enter a prompt, and click Generate. The first call downloads and loads the checkpoint (a few seconds); subsequent calls are fast.

Model
32 512
0 1.5

Compression stats

ModelAsadIsmail/Qwen3-1.7B-ternary
FP16 size2.8 GB
Ternary size1.03 GB
Compression2.7×
Quality retain97.5 %
Bit-width~4 bits effective
GPU speed28.8 tok/s
CPU speed11.3 tok/s
Runtime modecached
Device🔵 CPU

Qwen3-1.7B quantized to ternary weights {−1, 0, +1}. Retains 97.5 % of FP16 quality across standard benchmarks.

Example prompts

Model Prompt Max new tokens Temperature (0 = greedy)