DeepSeek V3 paper 阅读记录

DeepSeek-V3 仍采用 Multi-head Latent Attention（MLA）进行有效推理，并采用 DeepSeekMoE 进行经济有效的训练（MLA 与 DeepSeekMoE 为 DeepSeek 在之前模型中提出的技术）。除此之外，DeepSeek-V3 提出了 auxiliary-loss-free 策略用于负载均衡，目的是尽量减少因鼓励负载平衡而对模型性能造成的不利影响。其次，DeepSeek-V3 采用了 multi-token prediction training objective，据我们观察，此机制提高了评估基准的整体性能。

FP8 混合精度训练

Pipeline bubble

在模型训练方面：

提出 DualPipe 算法减少通信时间
充分利用通信带宽
优化内存占用，避免昂贵的 Tensor 并行

Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation communication overlap.

Conventional Transformer models usually adopts Multi-Head Attention (MHA), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency

Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache.

MHA

目标：

减少串行计算（RNN）
学习较远位置间的依赖（CNN）

$d$ 为 embedding 维度，$d_h$ 为每个 attention head 的维度，$n_h$ 为 attention head 的数量，$h_t \in \mathbb{R}^d$ 为某个 attention layer 的第 $t$ 个 token 的 attention 输入， ${\bf q}_t,{\bf k}_t,{\bf v}_t \in \mathbb{R}^{d_h n_h}$，$W^Q,W^K,W^V \in \mathbb{R}^{d_h n_h \times d}$。

\[{\bf q}_t = W^Q {\bf h}_t\] \[{\bf k}_t = W^K {\bf h}_t\] \[{\bf v}_t = W^V {\bf h}_t\] \[\begin{bmatrix} \color{red}{W_{1 \times 1, 1}} & \color{red}{W_{1 \times 1, 2}} & \cdots & \color{red}{W_{1 \times 1, d}} \\ \color{red}{W_{2 \times 1, 1}} & \color{red}{W_{2 \times 1, 2}} & \cdots & \color{red}{W_{2 \times 1, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{d_h \times 1, 1}} & \color{red}{W_{d_h \times 1, 2}} & \cdots & \color{red}{W_{d_h \times 1, d}} \\ \color{blue}{W_{1 \times 2, 1}} & \color{blue}{W_{1 \times 2, 2}} & \cdots & \color{blue}{W_{1 \times 2, d}} \\ \color{blue}{W_{2 \times 2, 1}} & \color{blue}{W_{2 \times 2, 2}} & \cdots & \color{blue}{W_{2 \times 2, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{blue}{W_{d_h \times 2, 1}} & \color{blue}{W_{d_h \times w, 2}} & \cdots & \color{blue}{W_{d_h \times 2, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{1 \times n_h, 1}} & \color{red}{W_{1 \times n_h, 2}} & \cdots & \color{red}{W_{1 \times n_h, d}} \\ \color{red}{W_{2 \times n_h, 1}} & \color{red}{W_{2 \times n_h, 2}} & \cdots & \color{red}{W_{2 \times n_h, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{d_h \times n_h, 1}} & \color{red}{W_{d_h \times n_h, 2}} & \cdots & \color{red}{W_{d_h \times n_h, d}} \\ \end{bmatrix}\] \[{\bf q}_t=[{\bf q}_{t,1};{\bf q}_{t,2};\dots;{\bf q}_{t,n_h}]\] \[{\bf k}_t=[{\bf k}_{t,1};{\bf k}_{t,2};\dots;{\bf k}_{t,n_h}]\] \[{\bf v}_t=[{\bf v}_{t,1};{\bf v}_{t,2};\dots;{\bf v}_{t,n_h}]\] \[{\bf o}_{t,i}=\sum^t_{j=1} {\rm Softmax}_j (\frac{{\bf q}^T_{t,i} {\bf k}_{j,i}}{\sqrt{d_h}}) {\bf v}_{j,i}\] \[{\bf u}_t=W^O[{\bf o}_{t,1};{\bf o}_{t,2};\dots;{\bf o}_{t,n_h}]\]

MLA

使用低秩矩阵分解降低 KV 缓存，以及推理过程中的内存占用
不直接对 Q、K 使用 RoPE，进行解耦

MoE

MoE 是一种稀疏推理

Auxiliary-Loss-Free Load Balancing

专家的负载不均衡会导致路由崩溃，且降低在专家并行场景中的计算效率。通常使用 Auxiliary Loss，但 Auxiliary Loss 如果太大会影响模型性能。

似乎如果过于保证负载均衡，会影响模型性能。为同时保证较好的负载均衡与模型性能，提出了 Auxiliary-Loss-Free Load Balancing。