In this post, I’ll break down PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models, a quantization method that decomposes weight matrices into ternary trit-planes to enable multiplication-free inference — rivaling quantization-aware training quality at a fraction of the cost.
the problem
Neural networks store knowledge in weights. Floating point numbers. A 7B parameter model at FP16 precision is 7 billion weights times 16 bits each. That’s 14 GB just sitting in memory doing nothing until you run inference.
Inference means matrix multiplications. Lots of them. Every single weight gets multiplied by an activation value. Multiply, accumulate, repeat billions of times. That’s what makes LLMs slow and expensive.
Quantization is the idea of using fewer bits per weight. Go from 16-bit floats to 4-bit integers and your model is 4x smaller. But push below 2 bits and things start to break. The weights lose too much information and the model falls apart.
PTQTP pushes to the extreme: each weight becomes a combination of just three values. Negative one, zero, or one. And it works.
trits, not bits
A bit has two states: 0 or 1. With 1 bit you can represent 2 things. With 2 bits you can represent 4 things (00, 01, 10, 11). In general, the number of things you can represent with bits grows exponentially: .
So how many bits do you need to represent 3 things? You’re solving . Take the log of both sides: . Not a whole number. You can’t actually use 1.585 physical bits — in hardware you’d round up to 2. But information-theoretically, a value that can be one of three things carries 1.58 bits of information. That’s where the “1.58-bit” precision comes from.
Those three things, in PTQTP, are . That’s a trit.
During inference, every weight gets multiplied by an input value. When weights are normal floats, that’s a real multiplication. But when the weight can only be one of , watch what happens:
1 × 3.7 = 3.7 (just use 3.7 as-is, no math needed)
-1 × 3.7 = -3.7 (just flip the sign of 3.7)
0 × 3.7 = 0.0 (just skip it entirely)
None of these are actual multiplications. The first is a copy. The second is a sign flip — toggling one bit in the floating point representation. The third is nothing at all. The multiply instruction never fires.
Why does this matter? Because multiplying two floating point numbers is expensive. The chip has to decompose both numbers, multiply the mantissas, add the exponents, normalize the result, handle edge cases. A floating point multiplier is a big circuit — it takes many clock cycles, burns energy, and eats die area.
An addition is way cheaper. The circuit is smaller, it finishes faster, it uses less power. A sign flip is even cheaper — literally toggling one bit. And skipping a zero costs nothing at all.
Now think about scale. A single layer in an LLM with weights does ~16 million multiplications. Per layer. Per token. Every one of them goes through that expensive multiplier circuit. Replace all of them with additions and sign flips and the chip spends almost no time on math. The bottleneck shifts to just loading the weights from memory fast enough.
the decomposition
Here’s the core idea. Take a weight matrix and approximate it as:
For a weight matrix of size :
- and are matrices filled with only , , or — the “trit-planes”
- and are vectors of length — one scale factor per row
The is row-wise scaling, not matrix multiplication. Each is a column vector and each is a matrix. multiplies every element in row of . Each row of the trit-plane gets scaled by its own scalar.
Two trit-planes. The first one captures the rough shape of the weights. The second one fixes the details. Think of it like a coarse pass and a fine pass.
Here’s what that looks like with numbers. Say we have a weight matrix:
Its decomposition is:
Row-wise scaling means multiplies every element in row 1 of , and multiplies every element in row 2. Written out element by element:
Two trits and two scalars per weight. That’s it.
the algorithm
Take this small weight matrix:
We need to find , , , and such that . Four unknowns. We can’t solve for all of them at once. So we break it into two smaller problems.
Problem A: If the trits are fixed, finding the best scales is a regression problem. For each row , we want and that minimize the squared error:
This is exactly the same as ordinary least squares regression. In classic linear regression you have data points , features , and you want coefficients that minimize . The solution is .
Here the mapping is:
- → the weights (what we’re trying to approximate)
- → the trit values from row of and , stacked as columns
- → the scales (the coefficients we’re solving for)
So the solution is . The adds a tiny number (like ) to the diagonal of before inverting. Why? Because inverting requires dividing by the determinant . If the trit vectors happen to be nearly identical, gets close to zero and the division blows up. Adding to the diagonal nudges it away from zero.
Since we only have 2 coefficients, is always . And a matrix has a simple closed-form inverse:
Swap the diagonals, negate the off-diagonals, divide by the determinant. No iterative solver needed.
To see this concretely, suppose we have these trit-planes:
Each row gives 4 equations with 2 unknowns. For row 1:
More equations than unknowns means there’s no exact solution in general, but there’s a least-squares best one. In matrix form, this is where stacks row 1 of each trit-plane as columns:
Plugging into the formula with :
multiplies each row of by the weight vector:
, . Here the determinant is far from zero (), so barely changes anything. But if the two trit vectors were nearly identical, the determinant would approach zero and the division would blow up — prevents that.
Problem B: If the scales are fixed, finding the best trits is even simpler. Each trit can only be , , or . Two trits per entry means possible combinations. Just try all nine, compute the error for each one, pick the smallest.
For example, say we know and . What trits best reconstruct ?
(T⁽¹⁾, T⁽²⁾) = (-1,-1) → 0.4×(-1) + 0.1×(-1) = -0.50 error = 1.000
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → 0.4×(-1) + 0.1×( 0) = -0.40 error = 0.810
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → 0.4×(-1) + 0.1×( 1) = -0.30 error = 0.640
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → 0.4×( 0) + 0.1×(-1) = -0.10 error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.4×( 0) + 0.1×( 0) = 0.00 error = 0.250
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.4×( 0) + 0.1×( 1) = 0.10 error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.4×( 1) + 0.1×(-1) = 0.30 error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.4×( 1) + 0.1×( 0) = 0.40 error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.4×( 1) + 0.1×( 1) = 0.50 error = 0.000 ← winner
gives zero error. Now for :
(T⁽¹⁾, T⁽²⁾) = (-1,-1) → -0.50 error = 0.640
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → -0.40 error = 0.490
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → -0.30 error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → -0.10 error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.00 error = 0.090
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.10 error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.30 error = 0.000 ← winner
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.40 error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.50 error = 0.040
The winner is . The first plane adds , the second subtracts , giving exactly . This is why two planes matter — the same scales can reconstruct both and by choosing different trit combinations. Repeat for every entry in the matrix, and you have the full trit-planes. Brute force. No cleverness needed.
Neither problem is useful alone. A needs trits to compute the scales. B needs scales to evaluate which trit combination is best — look at the table above, every error computation uses and . Without those numbers, there’s nothing to brute force over.
But if we start with a rough guess and keep alternating — solve A, then B, then A, then B — each step reduces the error and the whole thing converges.
step 1: initialize
Start simple. Set both trit-planes to the sign of each weight. Positive weight becomes , negative becomes .
Both planes are identical. Not great. But it’s a starting point.
step 2: find the best scales
With the trits fixed, apply Problem A to each row — solve the least-squares regression to get and .
step 3: find the best trits
With the scales fixed, apply Problem B to every entry — try all 9 trit combinations, pick the winner. The two planes start to diverge.
convergence
We keep alternating steps 2 and 3. Each round, the scales get better because the trits improved, and the trits get better because the scales improved. The error drops monotonically.
Our tiny example converges in one round because the matrix was small and the decomposition happened to be exact. In practice, with thousands of weights, the reconstruction error won’t reach zero — the algorithm iterates until the error drops below a tolerance (e.g. ), typically around 30 iterations, capped at 50.
The final decomposition (including row 2, which follows the same process):
These are exactly the trit-planes and scales we used in the Problem A and B examples above — the algorithm converges to them.
Let’s verify by reconstructing every entry.
Row 1: ,
Ŵ₁₁ = 0.4 × ( 1) + 0.1 × ( 1) = 0.4 + 0.1 = 0.5 ✓
Ŵ₁₂ = 0.4 × (-1) + 0.1 × (-1) = -0.4 - 0.1 = -0.5 ✓
Ŵ₁₃ = 0.4 × ( 1) + 0.1 × (-1) = 0.4 - 0.1 = 0.3 ✓
Ŵ₁₄ = 0.4 × (-1) + 0.1 × ( 1) = -0.4 + 0.1 = -0.3 ✓
Row 2: ,
Ŵ₂₁ = 0.5 × ( 1) + 0.3 × ( 1) = 0.5 + 0.3 = 0.8 ✓
Ŵ₂₂ = 0.5 × ( 1) + 0.3 × (-1) = 0.5 - 0.3 = 0.2 ✓
Ŵ₂₃ = 0.5 × (-1) + 0.3 × (-1) = -0.5 - 0.3 = -0.8 ✓
Ŵ₂₄ = 0.5 × (-1) + 0.3 × ( 1) = -0.5 + 0.3 = -0.2 ✓
Perfect reconstruction. Every entry matches.
why not just one plane?
You might wonder: why two trit-planes? Why not just one?
Let’s try. With a single trit-plane, the best we can do for row 1 is:
Reconstruction:
Ŵ₁₁ = 0.4 × ( 1) = 0.4 (want 0.5, error = 0.1)
Ŵ₁₂ = 0.4 × (-1) = -0.4 (want -0.5, error = 0.1)
Ŵ₁₃ = 0.4 × ( 1) = 0.4 (want 0.3, error = 0.1)
Ŵ₁₄ = 0.4 × (-1) = -0.4 (want -0.3, error = 0.1)
One scale factor can’t capture two different magnitudes ( and ). The second trit-plane fixes this. It adds where the weight is bigger and subtracts where it’s smaller. Coarse + fine = exact.
multiply without multiplying
Here’s where it gets good. Let’s do an actual forward pass.
Input vector:
The normal way (FP16)
Multiply every weight by the corresponding input, then sum:
y₁ = 0.5 × 1.0 + (-0.5) × 2.0 + 0.3 × (-1.0) + (-0.3) × 0.5
= 0.5 - 1.0 - 0.3 - 0.15
= -0.95
That’s 4 multiplications and 3 additions.
The PTQTP way
First, compute for row 1. Each trit just tells us whether to add, subtract, or skip:
T₁₁ = 1 → keep x₁ → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ = 1 → keep x₃ → -1.0
T₁₄ = -1 → negate x₄ → -0.5
sum = -2.5
Then for row 1:
T₁₁ = 1 → keep x₁ → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ = -1 → negate x₃ → +1.0
T₁₄ = 1 → keep x₄ → +0.5
sum = 0.5
No multiplications so far — just additions and sign flips. Now we need to multiply by the scale factors:
Same answer. But count the operations: the trit-vector dot products used zero multiplications — just additions and subtractions. The only real multiplications are the two scale factors at the end. Normal inference does multiplications per row — one per weight. PTQTP does 2 multiplications per row — one per trit-plane. The rest are additions.
Let’s do row 2 to be thorough:
T⁽¹⁾ · x: (+1.0) + (+2.0) + (+1.0) + (-0.5) = 3.5
T⁽²⁾ · x: (+1.0) + (-2.0) + (+1.0) + (+0.5) = 0.5
Verify with FP16:
For a row with entries, normal inference does 4096 multiplications. PTQTP does 2 multiplications (the scales) plus 4096 additions. That’s why it’s 4.63x faster.
the zero trit: free denoising
The zero in isn’t just another value. It’s a feature.
Small weights in neural networks are often noise. In binary quantization (), every weight gets forced to plus or minus one. A weight that’s — basically noise — gets amplified to . That destroys reasoning.
With trits, that weight can become zero. Say our scales are and . What’s the best trit combination for a weight of ?
(T⁽¹⁾, T⁽²⁾) = (0, 0) → 0.4×0 + 0.1×0 = 0.00 error = 0.02² = 0.0004
(T⁽¹⁾, T⁽²⁾) = (0, 1) → 0.4×0 + 0.1×1 = 0.10 error = 0.08² = 0.0064
(T⁽¹⁾, T⁽²⁾) = (1,-1) → 0.4×1 + 0.1×-1 = 0.30 error = 0.28² = 0.0784
The algorithm picks . That tiny weight gets killed. Free denoising.
The paper shows this matters most for math and coding tasks — exactly the tasks where noise hurts the most. Binary methods like PBLLM and BiLLM score 0% on Math-500. PTQTP scores 82.4%.
in practice
Our example used a tiny 2x4 matrix. Real models have weight matrices with thousands of rows and columns. A few things change at scale.
In our example, each row had one and one — two scales shared across all 4 columns. That works fine for 4 weights, but a real row might have 4096 entries. Two scales can’t capture that much variation.
The paper groups weights: divide each row into blocks of 128 columns, and give each block its own pair of scale factors. A row with 4096 entries gets groups, each with 2 FP16 scales. That’s 64 extra numbers per row — negligible overhead compared to 4096 trit values — and the approximation gets much better because each only needs to fit 128 weights instead of thousands.
The algorithm converges within about 30 iterations. The paper caps it at 50 with tolerance , but in practice the scales stop changing well before that.
The whole quantization process — all layers, all weights — runs in about an hour on a single GPU. Compare that to BitNet 1.58b which needs 10-14 GPU-days of full retraining. No retraining, no fine-tuning, no architecture changes. Just decompose and go.
the results
Tested on LLaMA 3.x and Qwen3, from 0.6B to 70B parameters:
- Matches or rivals BitNet 1.58b quality (which requires full retraining)
- 4.63x faster inference than FP16 on an RTX 3090
- ~1 hour of quantization vs 10-14 GPU-days for QAT
- No architecture changes needed — drop-in replacement
- 95% average accuracy retention on language reasoning tasks
- Math and coding capabilities preserved where binary methods completely collapse
The storage cost is about 4 bits per weight (2 trit-planes x 2 bits each, plus scale overhead). More than BitNet’s 1.58 bits, but PTQTP doesn’t require retraining — and that’s the whole point. You take any pretrained model, run the algorithm for an hour, and get multiplication-free inference.