Deep Dive

Small Language Models

Big intelligence in small packages - AI that runs anywhere

Not every AI task needs a 175-billion parameter model. Small Language Models (SLMs) - typically 1B to 13B parameters - deliver surprisingly strong performance at a fraction of the cost, latency, and energy. They can run on phones, laptops, and edge devices.

The secret: knowledge distillation (learning from larger models), better training data curation, and architectural innovations like Mixture of Experts. A well-trained 7B model today outperforms GPT-3 (175B) from 2020.

How It Works

Knowledge Distillation

A large 'teacher' model generates training data. The smaller 'student' model learns to mimic its outputs. The student captures 80-90% of the teacher's capability at 10x less cost.

Data Curation

Training on fewer but higher-quality tokens. Microsoft's Phi proved that textbook-quality data can train a 2.7B model that rivals models 25x its size.

Quantization

Reducing number precision from 16-bit to 8-bit or 4-bit (GGUF, GPTQ, AWQ). A 7B model goes from 14GB to 4GB - fits in phone memory.

Architecture Optimization

Grouped Query Attention, SwiGLU, RoPE - architectural choices that maximize performance per parameter. Mixture of Experts activates only relevant subnetworks.

On-Device Deployment

Frameworks like llama.cpp, MLX (Apple), ONNX Runtime, and MediaPipe run models directly on consumer hardware - no cloud needed.

Key Components

Phi (Microsoft)

2.7B-14B params, trained on 'textbook quality' data, punches above its weight

Gemma (Google)

2B-27B open models, optimized for on-device, strong safety training

Llama 3.2 (Meta)

1B-3B lightweight models for mobile and edge deployment

Mistral 7B

The model that proved open-source 7B can compete with proprietary 30B+

Qwen2.5 (Alibaba)

0.5B-72B range, strong multilingual, excellent at math and code

Apple Intelligence

On-device models for summarization, writing, Siri - privacy by design

Who's Building With This

Apple

On-device AI for iOS/macOS - summarize, rewrite, Siri, all private

Microsoft

Phi models prove small can be mighty. Powers Copilot on-device features.

Ollama

Run any open model locally with one command. 1M+ developers.

MediaTek/Qualcomm

NPUs in mobile chips - dedicated silicon for on-device AI inference

Key Takeaway

The future isn't just bigger models - it's the right-sized model for each task. SLMs enable AI everywhere: offline, private, instant, and cheap. The 7B model on your phone today is smarter than the 175B model of 2020.

References & Further Reading

← STORY OF INTELLIGENCE HOME