DeepSeek

DeepSeek

Summary

DeepSeek is a Chinese AI research lab (founded July 2023, backed by hedge fund High-Flyer) that develops large language models rivaling GPT-4 and Claude at a fraction of the cost. Its models are released under the MIT license and use a Mixture-of-Experts (MoE) architecture that activates only a subset of parameters per token, enabling dramatically cheaper training and inference. DeepSeek sent shockwaves through the industry by training DeepSeek-V3 for ~$6M — compared to ~$100M for GPT-4 — while matching or exceeding frontier model performance on coding and math benchmarks.

Content

Company Background

  • Founded: July 2023 by Liang Wenfeng (also CEO)
  • Headquarters: Hangzhou, Zhejiang, China
  • Funding: Owned and funded by High-Flyer, a Chinese quantitative hedge fund
  • License: MIT (weights, code, and research papers are fully open)

Model Lineup

DeepSeek-V3

The core generalist model. 671B total parameters in a MoE architecture with 37B activated per token. Trained on 2.788M H800 GPU hours. Pioneers auxiliary-loss-free load balancing and multi-token prediction. Outperforms GPT-4.5 on coding and math benchmarks. Uses Multi-head Latent Attention (MLA) for efficient inference.

DeepSeek-R1

A dedicated reasoning model built on top of V3 via reinforcement learning. Achieves OpenAI o1-level performance on mathematical proofs, algorithmic logic, and formal reasoning. 671B parameters, 164K context length. Additional training cost ~$294K beyond the base model.

DeepSeek-V3.2 / V3.2-Speciale

Late 2025 releases targeting frontier-level reasoning. V3.2-Speciale achieved gold-medal performance in the 2025 International Mathematical Olympiad (IMO) and International Olympiad in Informatics (IOI). Matches or surpasses GPT-5 on high-end reasoning tasks.

DeepSeek-V4 (upcoming)

Expected mid-2026. Context windows exceeding 1M tokens enabling full codebase ingestion and repository-level reasoning.

Architecture Highlights

  • MoE (Mixture-of-Experts): Only 37B of 671B parameters activated per token — dramatically reduces FLOPs vs. dense models
  • Multi-head Latent Attention (MLA): Compresses KV cache for efficient long-context inference
  • Auxiliary-loss-free load balancing: Improves MoE routing stability without auxiliary training losses
  • Multi-token prediction: Strengthens reasoning and generation quality

Cost Efficiency

  • Training V3: ~$6M total (vs. ~$100M for GPT-4)
  • API pricing: as low as $0.27/1M input tokens (cache miss) — 10–30x cheaper than comparable proprietary models
  • “Thinking in Tool-Use”: agent reasoning before tool calls with self-correction if outputs are inconsistent

Use Cases for AI/ML Practitioners

  • Drop-in replacement for GPT-4/Claude in cost-sensitive pipelines via OpenAI-compatible API
  • Self-hosting on Ollama, vLLM, or llama.cpp (weights available on Hugging Face)
  • Reasoning-heavy workloads: math, code generation, formal verification
  • Fine-tuning base for domain-specific models (MIT license permits commercial use)

References

NORMAL