Pramod.AI
AboutIntelligenceArtifactsAI EvolutionNewsletterBlogContact
← STORY OF INTELLIGENCEHOME
Deep Dive

Small Language Models

Big intelligence in small packages - AI that runs anywhere

Not every AI task needs a 175-billion parameter model. Small Language Models (SLMs) - typically 1B to 13B parameters - deliver surprisingly strong performance at a fraction of the cost, latency, and energy. They can run on phones, laptops, and edge devices.

The secret: knowledge distillation (learning from larger models), better training data curation, and architectural innovations like Mixture of Experts. A well-trained 7B model today outperforms GPT-3 (175B) from 2020.

SMALL LANGUAGE MODEL PIPELINELarge Model70B paramsKnowledgeDistillationSmall Model7B paramsQuantizationFP1614GBINT87GBINT44GBOn-DeviceCAPABILITY COMPARISONGPT-3 (175B, 2020)Phi-3 (3.8B, 2024)~92%46x smaller model achieves near-equivalent capability through distillation + quantization

How It Works

1

Knowledge Distillation

A large 'teacher' model generates training data. The smaller 'student' model learns to mimic its outputs. The student captures 80-90% of the teacher's capability at 10x less cost.

2

Data Curation

Training on fewer but higher-quality tokens. Microsoft's Phi proved that textbook-quality data can train a 2.7B model that rivals models 25x its size.

3

Quantization

Reducing number precision from 16-bit to 8-bit or 4-bit (GGUF, GPTQ, AWQ). A 7B model goes from 14GB to 4GB - fits in phone memory.

4

Architecture Optimization

Grouped Query Attention, SwiGLU, RoPE - architectural choices that maximize performance per parameter. Mixture of Experts activates only relevant subnetworks.

5

On-Device Deployment

Frameworks like llama.cpp, MLX (Apple), ONNX Runtime, and MediaPipe run models directly on consumer hardware - no cloud needed.

Key Components

Phi (Microsoft)

2.7B-14B params, trained on 'textbook quality' data, punches above its weight

Gemma (Google)

2B-27B open models, optimized for on-device, strong safety training

Llama 3.2 (Meta)

1B-3B lightweight models for mobile and edge deployment

Mistral 7B

The model that proved open-source 7B can compete with proprietary 30B+

Qwen2.5 (Alibaba)

0.5B-72B range, strong multilingual, excellent at math and code

Apple Intelligence

On-device models for summarization, writing, Siri - privacy by design

Who's Building With This

A

Apple

On-device AI for iOS/macOS - summarize, rewrite, Siri, all private

M

Microsoft

Phi models prove small can be mighty. Powers Copilot on-device features.

O

Ollama

Run any open model locally with one command. 1M+ developers.

M

MediaTek/Qualcomm

NPUs in mobile chips - dedicated silicon for on-device AI inference

Key Takeaway

The future isn't just bigger models - it's the right-sized model for each task. SLMs enable AI everywhere: offline, private, instant, and cheap. The 7B model on your phone today is smarter than the 175B model of 2020.

References & Further Reading

  1. Phi-3 Technical Report
  2. Llama.cpp
  3. GGUF Format Specification
  4. Apple Intelligence Foundation Models
  5. Knowledge Distillation Survey

Explore More Topics

Teaching AI to look things up before answeringRetrieval-Augmented GenerationWhen AI learns to see, hear, and speakMultimodal AIThe silicon, cloud, and systems powering intelligenceAI InfrastructureThe engines of intelligenceFoundation ModelsThe memory layer of AIVector DatabasesFrom chatbots to autonomous digital workersAI Agents in DepthA complete map of who builds what and how it all connectsThe AI EcosystemHow to train a model across thousands of GPUsDistributed AI TrainingMeasuring, monitoring, and operating AI in productionEval & AI OpsWhat artificial general intelligence really means and where we standThe Path to AGIHow silicon is evolving to bring AI from data centers to your pocketAI Chips and Edge IntelligenceWhich industries are winning with AI and how they're deploying itAI Sector Dominance
← Foundation ModelsVector Databases →