Skip to content

css add shadow
Mountain

Hi, this is Qihang👋. I’m documenting my learning process in this blog.



Test-Time Steering for Lossless Text Compression via Weighted Product of Experts

GitHub Repository EMNLP 2025 Paper

When I was a child, I always wondered: if I use a compressor to compress a file over and over again, will the file get smaller and smaller until it vanishes? Of course, the answer is no. If we compress the compressed data with the same compressor again, I will get a file with exactly the same size.

Today I understand this is because of the fundamental limits of lossless compression established by information theory. But what about using multiple compressors together? If we combine multiple compressors simultaneously, can each compressor reduce part of the data's redundancy? And how can we design such a method to combine different compressors?

This is the question that our work Test-Time Steering for Lossless Text Compression via Weighted Product of Experts 1 aims to answer. Compared to the EMNLP paper, this blog post focuses more on the intuition behind our method, presenting it in a way that is easier to understand.

Why the Exponential? From Max‑Entropy RL to the Boltzmann Distribution

Modern RL, attention mechanisms, classification, energy-based modeling, and statistical mechanics keep arriving at the same exponential shape:

\[ p(x)\;\propto\;\exp(\text{logits or reward}(x)/T)\quad\text{or}\quad p(x)\;\propto\;\exp(-E(x)/T). \]

Why does the exponential keep showing up, and what does the "temperature" actually do?