From Floats to Integers: Quantizing a Neural Network Without Losing Accuracy
Taking an MNIST classifier from float32 down to 8-bit integers without losing accuracy: the scale-and-zero-point trick, and how a dot product survives the move to integers.
Post-Training Quantization to Trit-Planes for Large Language Models
Understanding how trit-plane quantization compresses LLMs without retraining.