In this post, I’ll break down PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models, a quantization method that decomposes weight matrices into ternary trit-planes to enable multiplication-free inference — rivaling quantization-aware training quality at a fraction of the cost.
the problem
Neural networks store knowledge in weights. Floating point numbers. A 7B parameter model at FP16 precision is 7 billion weights times 16 bits each. That’s 14 GB just sitting in memory doing nothing until you run inference.
Inference means matrix multiplications. Lots of them. Every single weight gets multiplied by an activation value. Multiply, accumulate, repeat billions of times. That’s what makes LLMs slow and expensive.
Quantization is the idea of using fewer bits per weight. Go from 16-bit floats to 4-bit integers and your model is 4x smaller. But push below 2 bits and things start to break. The weights lose too much information and the model falls apart.
PTQTP pushes to the extreme: each weight becomes a combination of just three values. Negative one, zero, or one. And it works.
trits, not bits
A bit has two states: 0 or 1. With 1 bit you can represent 2 things. With 2 bits you can represent 4 things (00, 01, 10, 11). In general, the number of things you can represent with bits grows exponentially: .
So how many bits do you need to represent 3 things? You’re solving . Take the log of both sides: . Not a whole number. You can’t actually use 1.585 physical bits — in hardware you’d round up to 2. But information-theoretically, a value that can be one of three things carries 1.58 bits of information. That’s where the “1.58-bit” precision comes from.
Those three things, in PTQTP, are . That’s a trit.
During inference, every weight gets multiplied by an input value. When weights are normal floats, that’s a real multiplication. But when the weight can only be one of , watch what happens:
1 × 3.7 = 3.7 (just use 3.7 as-is, no math needed)
-1 × 3.7 = -3.7 (just flip the sign of 3.7)
0 × 3.7 = 0.0 (just skip it entirely)
None of these are actual multiplications. The first is a copy. The second is a sign flip — toggling one bit in the floating point representation. The third is nothing at all. The multiply instruction never fires.
Why does this matter? Because multiplying two floating point numbers is expensive. The chip has to decompose both numbers, multiply the mantissas, add the exponents, normalize the result, handle edge cases. A floating point multiplier is a big circuit — it takes many clock cycles, burns energy, and eats die area.
An addition is way cheaper. The circuit is smaller, it finishes faster, it uses less power. A sign flip is even cheaper — literally toggling one bit. And skipping a zero costs nothing at all.
Now think about scale. A single layer in an LLM with weights does ~16 million multiplications. Per layer. Per token. Every one of them goes through that expensive multiplier circuit. Replace all of them with additions and sign flips and the chip spends almost no time on math. The bottleneck shifts to just loading the weights from memory fast enough.
the decomposition
Here’s the core idea. Take a weight matrix and approximate it as:
For a weight matrix of size :
- and are matrices filled with only , , or — the “trit-planes”
- and are vectors of length — one scale factor per row
The is row-wise scaling, not matrix multiplication. Each is a column vector and each is a matrix. multiplies every element in row of . Each row of the trit-plane gets scaled by its own scalar.
Two trit-planes. The first one captures the rough shape of the weights. The second one fixes the details. Think of it like a coarse pass and a fine pass.
Here’s what that looks like with numbers. Say we have a weight matrix:
Its decomposition is:
Row-wise scaling means multiplies every element in row 1 of , and multiplies every element in row 2. Written out element by element:
Two trits and two scalars per weight. That’s it.
let’s compute the decomposition
Take this small weight matrix:
We need to find , , , and such that . Four unknowns. We can’t solve for all of them at once. So we break it into two smaller problems.
Problem A: If the trits are fixed, finding the best scales is a regression problem. For each row , we want and that minimize the squared error:
For our example, that’s:
The trits are known, the weights are known. The only unknowns are and . Each row gives 4 equations with 2 unknowns, e.g. for row 1:
More equations than unknowns means there’s no exact solution in general, but there’s a least-squares best one.
Problem B: If the scales are fixed, finding the best trits is even simpler. Each trit can only be , , or . Two trits per entry means possible combinations. Just try all nine, compute the error for each one, pick the smallest.
For example, say we know and . What trits best reconstruct ?
(T⁽¹⁾, T⁽²⁾) = (-1,-1) → 0.4×(-1) + 0.1×(-1) = -0.50 error = 1.000
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → 0.4×(-1) + 0.1×( 0) = -0.40 error = 0.810
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → 0.4×(-1) + 0.1×( 1) = -0.30 error = 0.640
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → 0.4×( 0) + 0.1×(-1) = -0.10 error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.4×( 0) + 0.1×( 0) = 0.00 error = 0.250
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.4×( 0) + 0.1×( 1) = 0.10 error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.4×( 1) + 0.1×(-1) = 0.30 error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.4×( 1) + 0.1×( 0) = 0.40 error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.4×( 1) + 0.1×( 1) = 0.50 error = 0.000 ← winner
gives zero error. Repeat for every entry in the matrix, and you have the full trit-planes. Brute force. No cleverness needed.
Neither problem is useful alone. A needs trits to compute the scales. B needs scales to evaluate which trit combination is best — look at the table above, every error computation uses and . Without those numbers, there’s nothing to brute force over.
But if we start with a rough guess and keep alternating — solve A, then B, then A, then B — each step reduces the error and the whole thing converges.
step 1: initialize
Start simple. Set both trit-planes to the sign of each weight. Positive weight becomes , negative becomes .
Both planes are identical. Not great. But it’s a starting point.
step 2: find the best scales (given the trits)
Now we have trits and need to find the scales that minimize reconstruction error — this is Problem A. For each row , we solve:
This is exactly the same as ordinary least squares regression. In classic linear regression you have data points , features , and you want coefficients that minimize . The solution is .
Here the mapping is:
- → the weights (what we’re trying to approximate)
- → the trit values from row of and , stacked as columns
- → the scales (the coefficients we’re solving for)
So the solution is . The adds a tiny number (like ) to the diagonal of before inverting. Why? Because inverting requires dividing by the determinant . If the trit vectors happen to be nearly identical, gets close to zero and the division blows up. Adding to the diagonal nudges it away from zero. In our example , so doesn’t matter — but it’s a safety net.
Since we only have 2 coefficients, is always . And a matrix has a simple closed-form inverse:
Swap the diagonals, negate the off-diagonals, divide by the determinant. No iterative solver needed.
For row 1, after the trit-planes have converged, stacks row 1 of and as columns:
Plugging into the formula (ignoring the tiny ):
, . The scales fall right out.
step 3: find the best trits (given the scales)
Now we have scales and need to find the best trit combination for each weight entry. Each entry has two trits ( and ), each can be , , or . That’s possible combinations. Just try all nine and pick the one with the smallest error.
Let’s do it for the first entry: , with , .
(T⁽¹⁾, T⁽²⁾) = (-1,-1) → 0.4×(-1) + 0.1×(-1) = -0.50 error = (0.5-(-0.50))² = 1.000
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → 0.4×(-1) + 0.1×( 0) = -0.40 error = (0.5-(-0.40))² = 0.810
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → 0.4×(-1) + 0.1×( 1) = -0.30 error = (0.5-(-0.30))² = 0.640
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → 0.4×( 0) + 0.1×(-1) = -0.10 error = (0.5-(-0.10))² = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.4×( 0) + 0.1×( 0) = 0.00 error = (0.5-( 0.00))² = 0.250
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.4×( 0) + 0.1×( 1) = 0.10 error = (0.5-( 0.10))² = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.4×( 1) + 0.1×(-1) = 0.30 error = (0.5-( 0.30))² = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.4×( 1) + 0.1×( 0) = 0.40 error = (0.5-( 0.40))² = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.4×( 1) + 0.1×( 1) = 0.50 error = (0.5-( 0.50))² = 0.000 ← winner
The winner is with zero error. The reconstruction hits exactly.
Now for the third entry: .
(T⁽¹⁾, T⁽²⁾) = (-1,-1) → -0.50 error = 0.640
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → -0.40 error = 0.490
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → -0.30 error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → -0.10 error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.00 error = 0.090
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.10 error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.30 error = 0.000 ← winner
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.40 error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.50 error = 0.040
The winner is . The first plane adds , the second subtracts , giving exactly . This is why two planes matter — the same scales can reconstruct both and by choosing different trit combinations.
Same process for the remaining entries.
convergence
We ran step 2 (find scales given trits) and step 3 (find trits given scales). Both steps agree. The scales we computed are optimal for these trits, and the trits we computed are optimal for these scales. Nothing changes if we iterate again. The algorithm has converged.
Our tiny example converged in one round because the matrix was small and the decomposition happened to be exact. In practice, with thousands of weights, the reconstruction error won’t reach zero — the algorithm iterates until the error drops below a tolerance (e.g. ), typically around 30 iterations, capped at 50.
Applying the same process to row 2 (I’ll spare you the nine-combo tables), we get:
Let’s verify by reconstructing every entry.
Row 1: ,
Ŵ₁₁ = 0.4 × ( 1) + 0.1 × ( 1) = 0.4 + 0.1 = 0.5 ✓
Ŵ₁₂ = 0.4 × (-1) + 0.1 × (-1) = -0.4 - 0.1 = -0.5 ✓
Ŵ₁₃ = 0.4 × ( 1) + 0.1 × (-1) = 0.4 - 0.1 = 0.3 ✓
Ŵ₁₄ = 0.4 × (-1) + 0.1 × ( 1) = -0.4 + 0.1 = -0.3 ✓
Row 2: ,
Ŵ₂₁ = 0.5 × ( 1) + 0.3 × ( 1) = 0.5 + 0.3 = 0.8 ✓
Ŵ₂₂ = 0.5 × ( 1) + 0.3 × (-1) = 0.5 - 0.3 = 0.2 ✓
Ŵ₂₃ = 0.5 × (-1) + 0.3 × (-1) = -0.5 - 0.3 = -0.8 ✓
Ŵ₂₄ = 0.5 × (-1) + 0.3 × ( 1) = -0.5 + 0.3 = -0.2 ✓
Perfect reconstruction. Every entry matches.
why not just one plane?
You might wonder: why two trit-planes? Why not just one?
Let’s try. With a single trit-plane, the best we can do for row 1 is:
Reconstruction:
Ŵ₁₁ = 0.4 × ( 1) = 0.4 (want 0.5, error = 0.1)
Ŵ₁₂ = 0.4 × (-1) = -0.4 (want -0.5, error = 0.1)
Ŵ₁₃ = 0.4 × ( 1) = 0.4 (want 0.3, error = 0.1)
Ŵ₁₄ = 0.4 × (-1) = -0.4 (want -0.3, error = 0.1)
One scale factor can’t capture two different magnitudes ( and ). The second trit-plane fixes this. It adds where the weight is bigger and subtracts where it’s smaller. Coarse + fine = exact.
multiply without multiplying
Here’s where it gets good. Let’s do an actual forward pass.
Input vector:
The normal way (FP16)
Multiply every weight by the corresponding input, then sum:
y₁ = 0.5 × 1.0 + (-0.5) × 2.0 + 0.3 × (-1.0) + (-0.3) × 0.5
= 0.5 - 1.0 - 0.3 - 0.15
= -0.95
That’s 4 multiplications and 3 additions.
The PTQTP way
First, compute for row 1. Each trit just tells us whether to add, subtract, or skip:
T₁₁ = 1 → keep x₁ → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ = 1 → keep x₃ → -1.0
T₁₄ = -1 → negate x₄ → -0.5
sum = -2.5
Then for row 1:
T₁₁ = 1 → keep x₁ → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ = -1 → negate x₃ → +1.0
T₁₄ = 1 → keep x₄ → +0.5
sum = 0.5
No multiplications so far — just additions and sign flips. Now we need to multiply by the scale factors:
Same answer. But count the operations: the trit-vector dot products used zero multiplications — just additions and subtractions. The only real multiplications are the two scale factors at the end. Normal inference does multiplications per row — one per weight. PTQTP does multiplications per row — one per trit-plane. The rest are additions. Multiplications go from to , where and can be thousands.
Let’s do row 2 to be thorough:
T⁽¹⁾ · x: (+1.0) + (+2.0) + (+1.0) + (-0.5) = 3.5
T⁽²⁾ · x: (+1.0) + (-2.0) + (+1.0) + (+0.5) = 0.5
Verify with FP16:
For a row with entries, normal inference does 4096 multiplications. PTQTP does 2 multiplications (the scales) plus 4096 additions. That’s why it’s 4.63x faster.
the zero trit: free denoising
The zero in isn’t just another value. It’s a feature.
Small weights in neural networks are often noise. In binary quantization (), every weight gets forced to plus or minus one. A weight that’s — basically noise — gets amplified to . That destroys reasoning.
With trits, that weight can become zero. Say our scales are and . What’s the best trit combination for a weight of ?
(T⁽¹⁾, T⁽²⁾) = (0, 0) → 0.4×0 + 0.1×0 = 0.00 error = 0.02² = 0.0004
(T⁽¹⁾, T⁽²⁾) = (0, 1) → 0.4×0 + 0.1×1 = 0.10 error = 0.08² = 0.0064
(T⁽¹⁾, T⁽²⁾) = (1,-1) → 0.4×1 + 0.1×-1 = 0.30 error = 0.28² = 0.0784
The algorithm picks . That tiny weight gets killed. Free denoising.
The paper shows this matters most for math and coding tasks — exactly the tasks where noise hurts the most. Binary methods like PBLLM and BiLLM score 0% on Math-500. PTQTP scores 82.4%.
in practice
Our example used a tiny 2x4 matrix. Real models have weight matrices with thousands of rows and columns. A few things change at scale.
The weight matrix gets divided into groups of 128 columns. Each group gets its own pair of scale factors. More groups means more scales to store, but 2 FP16 numbers per 128 weights is negligible overhead — and the approximation gets much better.
The algorithm converges within about 30 iterations. The paper caps it at 50 with tolerance , but in practice the scales stop changing well before that.
The whole quantization process — all layers, all weights — runs in about an hour on a single GPU. Compare that to BitNet 1.58b which needs 10-14 GPU-days of full retraining. No retraining, no fine-tuning, no architecture changes. Just decompose and go.
the results
Tested on LLaMA 3.x and Qwen3, from 0.6B to 70B parameters:
- Matches or rivals BitNet 1.58b quality (which requires full retraining)
- 4.63x faster inference than FP16 on an RTX 3090
- ~1 hour of quantization vs 10-14 GPU-days for QAT
- No architecture changes needed — drop-in replacement
- 95% average accuracy retention on language reasoning tasks
- Math and coding capabilities preserved where binary methods completely collapse
The storage cost is about 4 bits per weight (2 trit-planes x 2 bits each, plus scale overhead). More than BitNet’s 1.58 bits, but PTQTP doesn’t require retraining — and that’s the whole point. You take any pretrained model, run the algorithm for an hour, and get multiplication-free inference.