LLM : Model Merging/Model Fusion⚓︎
What is model merging ?
A techniques that combines the parameters(weights) of two or more pre-trained/fine-tuned language models with same/diff architectures into a single model, to inherit(blend) the capabilities or knowledge of each parent models without retraining from scratch. Example : Model-1 : Text Summarization Model-2 : Text Translation Then Merged M1 + M2 = Single model by combining serveral tasks specific models into a single models without any training or finetuning with just performating the mathematical operations onto the model parameters.
Why Model merging ?
-
Pre-training a large model from scratch or Post-training a foundation model is
-
Expensive
- Time-consuming
-
Compute-heavyS
-
Model merging can provide us a cheap, fast alternative that can
-
Merge several task specific finetuned models into single one.
- Single Unified model capable of handling multiple tasks.
- We don't need for additional gradient-based training or finetuning.
- Everything runs on CPU, no GPU required.
How model merging works ? If we have the same architecture then we can merge the model parameters based onto the interpolations of the each model parameters such that they preserve the embedding space geoemetry and properties.
- It algebraically mixes checkpoints—often as full weights or fine-tuning deltas—so the merged model inherits behaviors from multiple specialized models without re-training on raw data.
- It done offline via weight-space operations like linear averaging, spherical interpolation, or structured selection, and can run entirely on CPU with libraries such as mergekit.
Why does model merging works ?
- The fine-tuned models from a shared initialization often lie in a connected low-loss basin, so averaging their weights can improve accuracy and robustness—an effect popularized as “model soups”.
- The main failure mode is parameter interference (e.g., redundant updates and sign conflicts across models), which specialized schemes like TIES address by trimming small deltas and resolving sign disagreements before merging.
- Thus, well designed merging can generalize better than naive averages and provide stronger starting points for subsequent fine-tuning.
What are the most common methods for it ?
- Model-soups
- Spherical Linear Interpolation(SLERP)
- Task Arithmetic
- Trim, Elect Sign and Merge(TIES)
- Drop and Rescale
- Franken-Merging
Linear/Soup Averaging :
Spherical Linear Interpolation(SLERP)
Motivation : In computer graphics, problem of finding smooth transitions between two camera rotations.
What is SLERP ?
It only works with two models and we can favour one of them also.
It helps to preserve magnitude of weights and curvature of the embedding space.
TIES : TRIM, ELECT SIGN & MERGE
Parameter interference can be degrade merged performance
References⚓︎
- LLM Marathon Series Lecture : Model Merging
- Deep Dive : Model Merging - Part-1
- Deep Dive : Model Merging - Part-2
- https://huggingface.co/blog/mlabonne/merge-models
- https://github.com/arcee-ai/mergekit
- https://www.slideshare.net/slideshow/julien-simon-deep-dive-model-merging/270921708 7.