LLMs : Model Compression - Intro Part-0⚓︎
One of the quote is quite femous in Large scale models - Bigger is Better means more data, parameters and compute better the model will be. However, bigger models also drive up GPU memory needs, latency, inference cost, and environmental impact, so gains come with significant operational and sustainability tradeoffs.

Quest-1 : Why performance of the models scale ?
Neural scaling laws states that llms loss follows a power law with model size, dataset size, and training compute.
Where \(P\) is the performance of the model, \(N\) is the model size, \(D\) is the dataset size resp. “Bigger is better” holds under sufficient data and compute, but If for fixed budget then compute‑optimal scaling shows balanced growth of data and parameters yields the best efficiency, and well‑trained smaller models can outperform larger undertrained ones. There are some factors that affects the bigger is better quote during model deployment for serving.
- GPU memory requirements
- Latency
- Inference Costs
- Environmental Concerns
Thus, Large size LLMs during inference become too costy for business. Therefore, How can we accelerate and make inferencing affordable during serving time.
Ex-1 : Why is gpt-4o-mini so cheap when compared to gpt-4o ?
- It is a fast, affordable small model designed for focused tasks, and its outputs can be trained via distillation from a larger model like GPT‑4o to retain quality at lower cost and latency.
- The per‑token rates are far lower (e.g., GPT‑4o-mini at \(0.15/\)0.60 per 1M input/output tokens vs GPT‑4o at \(2.50/\)10.00), directly reflecting reduced inference cost.
- Because of distillation of larger‑model outputs into GPT‑4o‑mini and fine‑tune for domain tasks to retain accuracy while running at the mini model’s lower price point.
How can we deploy LLMs in a cost-effective manner while maintaining high performance?
Deployment of LLMs cost‑effectively while maintaining high performance by combining lossy model compression (quantization, pruning, distillation) with lossless engineering (batching, KV‑cache optimization, optimized kernels, speculative decoding) to cut compute, memory, and latency with minimal quality impact. These two approach can reduces per‑request cost while increasing throughput on existing hardware. As I said for cost-effective inferencing there are two approaches
- Loss-less : Efficient Engineering
- Lossy : Model Compression
Efficient Engineering
We try to maximize hardware utilization with continuous batching, optimized attention kernels, and careful scheduling to raise tokens‑per‑second per GPU without altering model quality. Effectively managing KV caches with reuse and memory‑aware techniques to extend context and concurrency without exhausting VRAM and use decoding accelerators like speculative decoding, where a small draft model proposes tokens that the target model verifies, achieving multi‑x speedups while preserving output fidelity(Perplexity)
Model Compression
It is inspired from the quote that Bigger is not better i.e opposite the neural scaling law. It reduces a model’s size and compute by approximating its parameters or behavior, trading a small amount of accuracy for large gains in speed, memory, and cost. In LLMs, the main lossy methods are quantization, pruning, and distillation, each altering the model in different ways to deliver efficiency improvements.
Lossy compression means to approximate information of large model so the compressed model behaves slightly differently from the original while being cheaper to run. There are three types of such techniques
- Quantization - It keep the model same but reduce the number of bits.
- Pruning - It remove parts of a model while retaining performance.
- Knowledge Distillation - It train a smaller model to imitate the bigger model.
Conclusion
We have studied why do we need model compression ? What is model compression ? and How can we deploy the large scale model into the production such that they are cost effect while maintaining the performance of the models ? In the next, upcomming articles we learn about the different model compression techniques that we discussed here. Hence next article will be on first model compression technique called quantization.
Reference⚓︎
- https://huggingface.co/blog/large-language-models
- [2203.15556] Training Compute-Optimal Large Language Models
- [2001.08361] Scaling Laws for Neural Language Models
- ELL881/AIL821 | LCS2-IITD