glassduck

Feb 8, 2026 ~ 19 min read

Post-Training Quantization to Trit-Planes for Large Language Models


In this post, I’ll break down PTQTP: Post-Training Quantization to Trit-Planes for Large Language Models, a quantization method that decomposes weight matrices into ternary trit-planes to enable multiplication-free inference — rivaling quantization-aware training quality at a fraction of the cost.

the problem

Neural networks store knowledge in weights. Floating point numbers. A 7B parameter model at FP16 precision is 7 billion weights times 16 bits each. That’s 14 GB just sitting in memory doing nothing until you run inference.

Inference means matrix multiplications. Lots of them. Every single weight gets multiplied by an activation value. Multiply, accumulate, repeat billions of times. That’s what makes LLMs slow and expensive.

Quantization is the idea of using fewer bits per weight. Go from 16-bit floats to 4-bit integers and your model is 4x smaller. But push below 2 bits and things start to break. The weights lose too much information and the model falls apart.

PTQTP pushes to the extreme: each weight becomes a combination of just three values. Negative one, zero, or one. And it works.

trits, not bits

A bit has two states: 0 or 1. With 1 bit you can represent 2 things. With 2 bits you can represent 4 things (00, 01, 10, 11). In general, the number of things you can represent with bits grows exponentially: n bits2n thingsn \text{ bits} \rightarrow 2^n \text{ things}.

So how many bits do you need to represent 3 things? You’re solving 2n=32^n = 3. Take the log of both sides: n=log2(3)=1.585n = \log_2(3) = 1.585. Not a whole number. You can’t actually use 1.585 physical bits — in hardware you’d round up to 2. But information-theoretically, a value that can be one of three things carries 1.58 bits of information. That’s where the “1.58-bit” precision comes from.

Those three things, in PTQTP, are {1,0,1}\{-1, 0, 1\}. That’s a trit.

During inference, every weight gets multiplied by an input value. When weights are normal floats, that’s a real multiplication. But when the weight can only be one of {1,0,1}\{-1, 0, 1\}, watch what happens:

 1 × 3.7 =  3.7    (just use 3.7 as-is, no math needed)
-1 × 3.7 = -3.7    (just flip the sign of 3.7)
 0 × 3.7 =  0.0    (just skip it entirely)

None of these are actual multiplications. The first is a copy. The second is a sign flip — toggling one bit in the floating point representation. The third is nothing at all. The multiply instruction never fires.

Why does this matter? Because multiplying two floating point numbers is expensive. The chip has to decompose both numbers, multiply the mantissas, add the exponents, normalize the result, handle edge cases. A floating point multiplier is a big circuit — it takes many clock cycles, burns energy, and eats die area.

An addition is way cheaper. The circuit is smaller, it finishes faster, it uses less power. A sign flip is even cheaper — literally toggling one bit. And skipping a zero costs nothing at all.

Now think about scale. A single layer in an LLM with 4096×40964096 \times 4096 weights does ~16 million multiplications. Per layer. Per token. Every one of them goes through that expensive multiplier circuit. Replace all of them with additions and sign flips and the chip spends almost no time on math. The bottleneck shifts to just loading the weights from memory fast enough.

the decomposition

Here’s the core idea. Take a weight matrix WW and approximate it as:

Wα(1)T(1)+α(2)T(2)W \approx \alpha^{(1)} \odot T^{(1)} + \alpha^{(2)} \odot T^{(2)}

For a weight matrix WW of size m×nm \times n:

  • T(1)T^{(1)} and T(2)T^{(2)} are m×nm \times n matrices filled with only 1-1, 00, or 11 — the “trit-planes”
  • α(1)\alpha^{(1)} and α(2)\alpha^{(2)} are vectors of length mm — one scale factor per row

The \odot is row-wise scaling, not matrix multiplication. Each α\alpha is a column vector and each TT is a matrix. αi\alpha_i multiplies every element in row ii of TT. Each row of the trit-plane gets scaled by its own scalar.

Two trit-planes. The first one captures the rough shape of the weights. The second one fixes the details. Think of it like a coarse pass and a fine pass.

Here’s what that looks like with numbers. Say we have a 2×42 \times 4 weight matrix:

W=[0.50.50.30.30.80.20.80.2]W = \begin{bmatrix} 0.5 & -0.5 & 0.3 & -0.3 \\ 0.8 & 0.2 & -0.8 & -0.2 \end{bmatrix}

Its decomposition is:

α(1)=[0.40.5],T(1)=[11111111]\alpha^{(1)} = \begin{bmatrix} 0.4 \\ 0.5 \end{bmatrix}, \quad T^{(1)} = \begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \end{bmatrix} α(2)=[0.10.3],T(2)=[11111111]\alpha^{(2)} = \begin{bmatrix} 0.1 \\ 0.3 \end{bmatrix}, \quad T^{(2)} = \begin{bmatrix} 1 & -1 & -1 & 1 \\ 1 & -1 & -1 & 1 \end{bmatrix}

Row-wise scaling means α1(1)=0.4\alpha^{(1)}_1 = 0.4 multiplies every element in row 1 of T(1)T^{(1)}, and α2(1)=0.5\alpha^{(1)}_2 = 0.5 multiplies every element in row 2. Written out element by element:

W^ij=αi(1)Tij(1)+αi(2)Tij(2)\hat{W}_{ij} = \alpha_i^{(1)} \cdot T_{ij}^{(1)} + \alpha_i^{(2)} \cdot T_{ij}^{(2)}

Two trits and two scalars per weight. That’s it.

let’s compute the decomposition

Take this small 2×42 \times 4 weight matrix:

W=[0.50.50.30.30.80.20.80.2]W = \begin{bmatrix} 0.5 & -0.5 & 0.3 & -0.3 \\ 0.8 & 0.2 & -0.8 & -0.2 \end{bmatrix}

We need to find T(1)T^{(1)}, T(2)T^{(2)}, α(1)\alpha^{(1)}, and α(2)\alpha^{(2)} such that Wα(1)T(1)+α(2)T(2)W \approx \alpha^{(1)} \odot T^{(1)} + \alpha^{(2)} \odot T^{(2)}. Four unknowns. We can’t solve for all of them at once. So we break it into two smaller problems.

Problem A: If the trits are fixed, finding the best scales is a regression problem. For each row ii, we want αi(1)\alpha_i^{(1)} and αi(2)\alpha_i^{(2)} that minimize the squared error:

minαi(1),αi(2)j(Wijαi(1)Tij(1)αi(2)Tij(2))2\min_{\alpha_i^{(1)},\, \alpha_i^{(2)}} \sum_{j} \left( W_{ij} - \alpha_i^{(1)} T_{ij}^{(1)} - \alpha_i^{(2)} T_{ij}^{(2)} \right)^2

For our example, that’s:

[0.50.50.30.30.80.20.80.2]=[α1(1)α2(1)][11111111]+[α1(2)α2(2)][11111111]\begin{bmatrix} 0.5 & -0.5 & 0.3 & -0.3 \\ 0.8 & 0.2 & -0.8 & -0.2 \end{bmatrix} = \begin{bmatrix} \alpha_1^{(1)} \\ \alpha_2^{(1)} \end{bmatrix} \odot \begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \end{bmatrix} + \begin{bmatrix} \alpha_1^{(2)} \\ \alpha_2^{(2)} \end{bmatrix} \odot \begin{bmatrix} 1 & -1 & -1 & 1 \\ 1 & -1 & -1 & 1 \end{bmatrix}

The trits are known, the weights are known. The only unknowns are α(1)\alpha^{(1)} and α(2)\alpha^{(2)}. Each row gives 4 equations with 2 unknowns, e.g. for row 1:

0.5=α1(1)1+α1(2)10.5=α1(1)(1)+α1(2)(1)0.3=α1(1)1+α1(2)(1)0.3=α1(1)(1)+α1(2)1\begin{aligned} 0.5 &= \alpha_1^{(1)} \cdot 1 + \alpha_1^{(2)} \cdot 1 \\ -0.5 &= \alpha_1^{(1)} \cdot (-1) + \alpha_1^{(2)} \cdot (-1) \\ 0.3 &= \alpha_1^{(1)} \cdot 1 + \alpha_1^{(2)} \cdot (-1) \\ -0.3 &= \alpha_1^{(1)} \cdot (-1) + \alpha_1^{(2)} \cdot 1 \end{aligned}

More equations than unknowns means there’s no exact solution in general, but there’s a least-squares best one.

Problem B: If the scales are fixed, finding the best trits is even simpler. Each trit can only be 1-1, 00, or 11. Two trits per entry means 3×3=93 \times 3 = 9 possible combinations. Just try all nine, compute the error for each one, pick the smallest.

For example, say we know α1(1)=0.4\alpha_1^{(1)} = 0.4 and α1(2)=0.1\alpha_1^{(2)} = 0.1. What trits best reconstruct W11=0.5W_{11} = 0.5?

(T⁽¹⁾, T⁽²⁾) = (-1,-1) → 0.4×(-1) + 0.1×(-1) = -0.50  error = 1.000
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → 0.4×(-1) + 0.1×( 0) = -0.40  error = 0.810
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → 0.4×(-1) + 0.1×( 1) = -0.30  error = 0.640
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → 0.4×( 0) + 0.1×(-1) = -0.10  error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.4×( 0) + 0.1×( 0) =  0.00  error = 0.250
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.4×( 0) + 0.1×( 1) =  0.10  error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.4×( 1) + 0.1×(-1) =  0.30  error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.4×( 1) + 0.1×( 0) =  0.40  error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.4×( 1) + 0.1×( 1) =  0.50  error = 0.000 ← winner

(1,1)(1, 1) gives zero error. Repeat for every entry in the matrix, and you have the full trit-planes. Brute force. No cleverness needed.

Neither problem is useful alone. A needs trits to compute the scales. B needs scales to evaluate which trit combination is best — look at the table above, every error computation uses 0.40.4 and 0.10.1. Without those numbers, there’s nothing to brute force over.

But if we start with a rough guess and keep alternating — solve A, then B, then A, then B — each step reduces the error and the whole thing converges.

step 1: initialize

Start simple. Set both trit-planes to the sign of each weight. Positive weight becomes 11, negative becomes 1-1.

T(1)=T(2)=sign(W)=[11111111]T^{(1)} = T^{(2)} = \text{sign}(W) = \begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \end{bmatrix}

Both planes are identical. Not great. But it’s a starting point.

step 2: find the best scales (given the trits)

Now we have trits and need to find the scales that minimize reconstruction error — this is Problem A. For each row ii, we solve:

minαi(1),αi(2)j(Wijαi(1)Tij(1)αi(2)Tij(2))2\min_{\alpha_i^{(1)},\, \alpha_i^{(2)}} \sum_{j} \left( W_{ij} - \alpha_i^{(1)} T_{ij}^{(1)} - \alpha_i^{(2)} T_{ij}^{(2)} \right)^2

This is exactly the same as ordinary least squares regression. In classic linear regression you have data points yy, features XX, and you want coefficients β\beta that minimize yXβ2\|y - X\beta\|^2. The solution is β=(XX)1Xy\beta = (X^\top X)^{-1} X^\top y.

Here the mapping is:

  • yy → the weights WiW_i (what we’re trying to approximate)
  • XX → the trit values from row ii of T(1)T^{(1)} and T(2)T^{(2)}, stacked as columns
  • β\beta → the scales αi\alpha_i (the coefficients we’re solving for)

So the solution is αi=(XX+λI)1XWi\alpha_i = (X^\top X + \lambda I)^{-1} \cdot X^\top W_i. The λI\lambda I adds a tiny number (like 10410^{-4}) to the diagonal of XXX^\top X before inverting. Why? Because inverting requires dividing by the determinant adbcad - bc. If the trit vectors happen to be nearly identical, adbcad - bc gets close to zero and the division blows up. Adding λ\lambda to the diagonal nudges it away from zero. In our example adbc=16ad - bc = 16, so λ\lambda doesn’t matter — but it’s a safety net.

Since we only have 2 coefficients, XXX^\top X is always 2×22 \times 2. And a 2×22 \times 2 matrix has a simple closed-form inverse:

[abcd]1=1adbc[dbca]\begin{bmatrix} a & b \\ c & d \end{bmatrix}^{-1} = \frac{1}{ad - bc} \begin{bmatrix} d & -b \\ -c & a \end{bmatrix}

Swap the diagonals, negate the off-diagonals, divide by the determinant. No iterative solver needed.

For row 1, after the trit-planes have converged, XX stacks row 1 of T(1)T^{(1)} and T(2)T^{(2)} as columns:

X=[11111111],W1=[0.50.50.30.3]X = \begin{bmatrix} 1 & 1 \\ -1 & -1 \\ 1 & -1 \\ -1 & 1 \end{bmatrix}, \quad W_1 = \begin{bmatrix} 0.5 \\ -0.5 \\ 0.3 \\ -0.3 \end{bmatrix}

Plugging into the formula (ignoring the tiny λ\lambda):

XX=[4004],XW1=[1.60.4]X^\top X = \begin{bmatrix} 4 & 0 \\ 0 & 4 \end{bmatrix}, \quad X^\top W_1 = \begin{bmatrix} 1.6 \\ 0.4 \end{bmatrix} α1=[4004]1[1.60.4]=116[4004][1.60.4]=[0.40.1]\alpha_1 = \begin{bmatrix} 4 & 0 \\ 0 & 4 \end{bmatrix}^{-1} \begin{bmatrix} 1.6 \\ 0.4 \end{bmatrix} = \frac{1}{16}\begin{bmatrix} 4 & 0 \\ 0 & 4 \end{bmatrix} \begin{bmatrix} 1.6 \\ 0.4 \end{bmatrix} = \begin{bmatrix} 0.4 \\ 0.1 \end{bmatrix}

α1(1)=0.4\alpha_1^{(1)} = 0.4, α1(2)=0.1\alpha_1^{(2)} = 0.1. The scales fall right out.

step 3: find the best trits (given the scales)

Now we have scales and need to find the best trit combination for each weight entry. Each entry has two trits (T(1)T^{(1)} and T(2)T^{(2)}), each can be 1-1, 00, or 11. That’s 3×3=93 \times 3 = 9 possible combinations. Just try all nine and pick the one with the smallest error.

Let’s do it for the first entry: w=0.5w = 0.5, with α(1)=0.4\alpha^{(1)} = 0.4, α(2)=0.1\alpha^{(2)} = 0.1.

(T⁽¹⁾, T⁽²⁾) = (-1,-1) → 0.4×(-1) + 0.1×(-1) = -0.50  error = (0.5-(-0.50))² = 1.000
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → 0.4×(-1) + 0.1×( 0) = -0.40  error = (0.5-(-0.40))² = 0.810
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → 0.4×(-1) + 0.1×( 1) = -0.30  error = (0.5-(-0.30))² = 0.640
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → 0.4×( 0) + 0.1×(-1) = -0.10  error = (0.5-(-0.10))² = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) → 0.4×( 0) + 0.1×( 0) =  0.00  error = (0.5-( 0.00))² = 0.250
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) → 0.4×( 0) + 0.1×( 1) =  0.10  error = (0.5-( 0.10))² = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) → 0.4×( 1) + 0.1×(-1) =  0.30  error = (0.5-( 0.30))² = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) → 0.4×( 1) + 0.1×( 0) =  0.40  error = (0.5-( 0.40))² = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) → 0.4×( 1) + 0.1×( 1) =  0.50  error = (0.5-( 0.50))² = 0.000 ← winner

The winner is (1,1)(1, 1) with zero error. The reconstruction hits 0.50.5 exactly.

Now for the third entry: w=0.3w = 0.3.

(T⁽¹⁾, T⁽²⁾) = (-1,-1) → -0.50  error = 0.640
(T⁽¹⁾, T⁽²⁾) = (-1, 0) → -0.40  error = 0.490
(T⁽¹⁾, T⁽²⁾) = (-1, 1) → -0.30  error = 0.360
(T⁽¹⁾, T⁽²⁾) = ( 0,-1) → -0.10  error = 0.160
(T⁽¹⁾, T⁽²⁾) = ( 0, 0) →  0.00  error = 0.090
(T⁽¹⁾, T⁽²⁾) = ( 0, 1) →  0.10  error = 0.040
(T⁽¹⁾, T⁽²⁾) = ( 1,-1) →  0.30  error = 0.000 ← winner
(T⁽¹⁾, T⁽²⁾) = ( 1, 0) →  0.40  error = 0.010
(T⁽¹⁾, T⁽²⁾) = ( 1, 1) →  0.50  error = 0.040

The winner is (1,1)(1, -1). The first plane adds 0.40.4, the second subtracts 0.10.1, giving exactly 0.30.3. This is why two planes matter — the same scales can reconstruct both 0.50.5 and 0.30.3 by choosing different trit combinations.

Same process for the remaining entries.

convergence

We ran step 2 (find scales given trits) and step 3 (find trits given scales). Both steps agree. The scales we computed are optimal for these trits, and the trits we computed are optimal for these scales. Nothing changes if we iterate again. The algorithm has converged.

Our tiny example converged in one round because the matrix was small and the decomposition happened to be exact. In practice, with thousands of weights, the reconstruction error won’t reach zero — the algorithm iterates until the error drops below a tolerance (e.g. ϵ=104\epsilon = 10^{-4}), typically around 30 iterations, capped at 50.

Applying the same process to row 2 (I’ll spare you the nine-combo tables), we get:

T(1)=[11111111],T(2)=[11111111]T^{(1)} = \begin{bmatrix} 1 & -1 & 1 & -1 \\ 1 & 1 & -1 & -1 \end{bmatrix}, \quad T^{(2)} = \begin{bmatrix} 1 & -1 & -1 & 1 \\ 1 & -1 & -1 & 1 \end{bmatrix} α(1)=[0.40.5],α(2)=[0.10.3]\alpha^{(1)} = \begin{bmatrix} 0.4 \\ 0.5 \end{bmatrix}, \quad \alpha^{(2)} = \begin{bmatrix} 0.1 \\ 0.3 \end{bmatrix}

Let’s verify by reconstructing every entry.

Row 1: α(1)=0.4\alpha^{(1)} = 0.4, α(2)=0.1\alpha^{(2)} = 0.1

Ŵ₁₁ = 0.4 × ( 1) + 0.1 × ( 1) =  0.4 + 0.1 =  0.5 ✓
Ŵ₁₂ = 0.4 × (-1) + 0.1 × (-1) = -0.4 - 0.1 = -0.5 ✓
Ŵ₁₃ = 0.4 × ( 1) + 0.1 × (-1) =  0.4 - 0.1 =  0.3 ✓
Ŵ₁₄ = 0.4 × (-1) + 0.1 × ( 1) = -0.4 + 0.1 = -0.3 ✓

Row 2: α(1)=0.5\alpha^{(1)} = 0.5, α(2)=0.3\alpha^{(2)} = 0.3

Ŵ₂₁ = 0.5 × ( 1) + 0.3 × ( 1) =  0.5 + 0.3 =  0.8 ✓
Ŵ₂₂ = 0.5 × ( 1) + 0.3 × (-1) =  0.5 - 0.3 =  0.2 ✓
Ŵ₂₃ = 0.5 × (-1) + 0.3 × (-1) = -0.5 - 0.3 = -0.8 ✓
Ŵ₂₄ = 0.5 × (-1) + 0.3 × ( 1) = -0.5 + 0.3 = -0.2 ✓

Perfect reconstruction. Every entry matches.

why not just one plane?

You might wonder: why two trit-planes? Why not just one?

Let’s try. With a single trit-plane, the best we can do for row 1 is:

T=[1111],α=0.5+0.5+0.3+0.34=0.4T = \begin{bmatrix} 1 & -1 & 1 & -1 \end{bmatrix}, \quad \alpha = \frac{|0.5| + |{-0.5}| + |0.3| + |{-0.3}|}{4} = 0.4

Reconstruction:

Ŵ₁₁ = 0.4 × ( 1) =  0.4    (want  0.5, error = 0.1)
Ŵ₁₂ = 0.4 × (-1) = -0.4    (want -0.5, error = 0.1)
Ŵ₁₃ = 0.4 × ( 1) =  0.4    (want  0.3, error = 0.1)
Ŵ₁₄ = 0.4 × (-1) = -0.4    (want -0.3, error = 0.1)

One scale factor can’t capture two different magnitudes (0.50.5 and 0.30.3). The second trit-plane fixes this. It adds 0.10.1 where the weight is bigger and subtracts 0.10.1 where it’s smaller. Coarse + fine = exact.

multiply without multiplying

Here’s where it gets good. Let’s do an actual forward pass.

Input vector: x=[1.02.01.00.5]x = \begin{bmatrix} 1.0 & 2.0 & -1.0 & 0.5 \end{bmatrix}

The normal way (FP16)

Multiply every weight by the corresponding input, then sum:

y₁ = 0.5 × 1.0 + (-0.5) × 2.0 + 0.3 × (-1.0) + (-0.3) × 0.5
   = 0.5 - 1.0 - 0.3 - 0.15
   = -0.95

That’s 4 multiplications and 3 additions.

The PTQTP way

First, compute T(1)xT^{(1)} \cdot x for row 1. Each trit just tells us whether to add, subtract, or skip:

T₁₁ =  1 → keep x₁  → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ =  1 → keep x₃  → -1.0
T₁₄ = -1 → negate x₄ → -0.5
                    sum = -2.5

Then T(2)xT^{(2)} \cdot x for row 1:

T₁₁ =  1 → keep x₁  → +1.0
T₁₂ = -1 → negate x₂ → -2.0
T₁₃ = -1 → negate x₃ → +1.0
T₁₄ =  1 → keep x₄  → +0.5
                    sum =  0.5

No multiplications so far — just additions and sign flips. Now we need to multiply by the scale factors:

y1=0.4×(2.5)+0.1×0.5=1.0+0.05=0.95y_1 = 0.4 \times (-2.5) + 0.1 \times 0.5 = -1.0 + 0.05 = -0.95 \checkmark

Same answer. But count the operations: the trit-vector dot products used zero multiplications — just additions and subtractions. The only real multiplications are the two scale factors at the end. Normal inference does nn multiplications per row — one per weight. PTQTP does kk multiplications per row — one per trit-plane. The rest are additions. Multiplications go from O(n)O(n) to O(k)O(k), where k=2k = 2 and nn can be thousands.

Let’s do row 2 to be thorough:

T⁽¹⁾ · x: (+1.0) + (+2.0) + (+1.0) + (-0.5) = 3.5
T⁽²⁾ · x: (+1.0) + (-2.0) + (+1.0) + (+0.5) = 0.5
y2=0.5×3.5+0.3×0.5=1.75+0.15=1.9y_2 = 0.5 \times 3.5 + 0.3 \times 0.5 = 1.75 + 0.15 = 1.9

Verify with FP16: 0.8×1.0+0.2×2.0+(0.8)×(1.0)+(0.2)×0.5=0.8+0.4+0.80.1=1.90.8 \times 1.0 + 0.2 \times 2.0 + (-0.8) \times (-1.0) + (-0.2) \times 0.5 = 0.8 + 0.4 + 0.8 - 0.1 = 1.9 \checkmark

For a row with d=4096d = 4096 entries, normal inference does 4096 multiplications. PTQTP does 2 multiplications (the scales) plus 4096 additions. That’s why it’s 4.63x faster.

the zero trit: free denoising

The zero in {1,0,1}\{-1, 0, 1\} isn’t just another value. It’s a feature.

Small weights in neural networks are often noise. In binary quantization ({1,+1}\{-1, +1\}), every weight gets forced to plus or minus one. A weight that’s 0.020.02 — basically noise — gets amplified to 1.01.0. That destroys reasoning.

With trits, that 0.020.02 weight can become zero. Say our scales are α(1)=0.4\alpha^{(1)} = 0.4 and α(2)=0.1\alpha^{(2)} = 0.1. What’s the best trit combination for a weight of 0.020.02?

(T⁽¹⁾, T⁽²⁾) = (0, 0) → 0.4×0 + 0.1×0 = 0.00   error = 0.02²  = 0.0004
(T⁽¹⁾, T⁽²⁾) = (0, 1) → 0.4×0 + 0.1×1 = 0.10   error = 0.08²  = 0.0064
(T⁽¹⁾, T⁽²⁾) = (1,-1) → 0.4×1 + 0.1×-1 = 0.30   error = 0.28²  = 0.0784

The algorithm picks (0,0)(0, 0). That tiny weight gets killed. Free denoising.

The paper shows this matters most for math and coding tasks — exactly the tasks where noise hurts the most. Binary methods like PBLLM and BiLLM score 0% on Math-500. PTQTP scores 82.4%.

in practice

Our example used a tiny 2x4 matrix. Real models have weight matrices with thousands of rows and columns. A few things change at scale.

The weight matrix gets divided into groups of 128 columns. Each group gets its own pair of scale factors. More groups means more scales to store, but 2 FP16 numbers per 128 weights is negligible overhead — and the approximation gets much better.

The algorithm converges within about 30 iterations. The paper caps it at 50 with tolerance ϵ=104\epsilon = 10^{-4}, but in practice the scales stop changing well before that.

The whole quantization process — all layers, all weights — runs in about an hour on a single GPU. Compare that to BitNet 1.58b which needs 10-14 GPU-days of full retraining. No retraining, no fine-tuning, no architecture changes. Just decompose and go.

the results

Tested on LLaMA 3.x and Qwen3, from 0.6B to 70B parameters:

  • Matches or rivals BitNet 1.58b quality (which requires full retraining)
  • 4.63x faster inference than FP16 on an RTX 3090
  • ~1 hour of quantization vs 10-14 GPU-days for QAT
  • No architecture changes needed — drop-in replacement
  • 95% average accuracy retention on language reasoning tasks
  • Math and coding capabilities preserved where binary methods completely collapse

The storage cost is about 4 bits per weight (2 trit-planes x 2 bits each, plus scale overhead). More than BitNet’s 1.58 bits, but PTQTP doesn’t require retraining — and that’s the whole point. You take any pretrained model, run the algorithm for an hour, and get multiplication-free inference.


glassduck — polished conversations with AI.