DeepSeek V3 paper 阅读记录
DeepSeek-V3 仍采用 Multi-head Latent Attention(MLA)进行有效推理,并采用 DeepSeekMoE 进行经济有效的训练(MLA 与 DeepSeekMoE 为 DeepSeek 在之前模型中提出的技术)。 除此之外,DeepSeek-V3 提出了 auxiliary-loss-free 策略用于负载均衡,目的是尽量减少因鼓励负载平衡而对模型性能造成的不利影响。 其次,DeepSeek-V3 采用了 multi-token prediction training objective,据我们观察,此机制提高了评估基准的整体性能。
FP8 混合精度训练
Pipeline bubble
在模型训练方面:
- 提出 DualPipe 算法减少通信时间
- 充分利用通信带宽
- 优化内存占用,避免昂贵的 Tensor 并行
Through the co-design of algorithms, frameworks, and hardware, we overcome the communication bottleneck in cross-node MoE training, achieving near-full computation communication overlap.
Conventional Transformer models usually adopts Multi-Head Attention (MHA), but during generation, its heavy Key-Value (KV) cache will become the bottleneck that limit the inference efficiency
Equipped with low-rank key-value joint compression, MLA achieves better performance than MHA, but requires a significantly smaller amount of KV cache.
MHA
目标:
- 减少串行计算(RNN)
- 学习较远位置间的依赖(CNN)
$d$ 为 embedding 维度,$d_h$ 为每个 attention head 的维度,$n_h$ 为 attention head 的数量,$h_t \in \mathbb{R}^d$ 为某个 attention layer 的第 $t$ 个 token 的 attention 输入, ${\bf q}_t,{\bf k}_t,{\bf v}_t \in \mathbb{R}^{d_h n_h}$,$W^Q,W^K,W^V \in \mathbb{R}^{d_h n_h \times d}$。
\[{\bf q}_t = W^Q {\bf h}_t\] \[{\bf k}_t = W^K {\bf h}_t\] \[{\bf v}_t = W^V {\bf h}_t\] \[\begin{bmatrix} \color{red}{W_{1 \times 1, 1}} & \color{red}{W_{1 \times 1, 2}} & \cdots & \color{red}{W_{1 \times 1, d}} \\ \color{red}{W_{2 \times 1, 1}} & \color{red}{W_{2 \times 1, 2}} & \cdots & \color{red}{W_{2 \times 1, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{d_h \times 1, 1}} & \color{red}{W_{d_h \times 1, 2}} & \cdots & \color{red}{W_{d_h \times 1, d}} \\ \color{blue}{W_{1 \times 2, 1}} & \color{blue}{W_{1 \times 2, 2}} & \cdots & \color{blue}{W_{1 \times 2, d}} \\ \color{blue}{W_{2 \times 2, 1}} & \color{blue}{W_{2 \times 2, 2}} & \cdots & \color{blue}{W_{2 \times 2, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{blue}{W_{d_h \times 2, 1}} & \color{blue}{W_{d_h \times w, 2}} & \cdots & \color{blue}{W_{d_h \times 2, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{1 \times n_h, 1}} & \color{red}{W_{1 \times n_h, 2}} & \cdots & \color{red}{W_{1 \times n_h, d}} \\ \color{red}{W_{2 \times n_h, 1}} & \color{red}{W_{2 \times n_h, 2}} & \cdots & \color{red}{W_{2 \times n_h, d}} \\ \vdots & \vdots & \ddots & \vdots \\ \color{red}{W_{d_h \times n_h, 1}} & \color{red}{W_{d_h \times n_h, 2}} & \cdots & \color{red}{W_{d_h \times n_h, d}} \\ \end{bmatrix}\] \[{\bf q}_t=[{\bf q}_{t,1};{\bf q}_{t,2};\dots;{\bf q}_{t,n_h}]\] \[{\bf k}_t=[{\bf k}_{t,1};{\bf k}_{t,2};\dots;{\bf k}_{t,n_h}]\] \[{\bf v}_t=[{\bf v}_{t,1};{\bf v}_{t,2};\dots;{\bf v}_{t,n_h}]\] \[{\bf o}_{t,i}=\sum^t_{j=1} {\rm Softmax}_j (\frac{{\bf q}^T_{t,i} {\bf k}_{j,i}}{\sqrt{d_h}}) {\bf v}_{j,i}\] \[{\bf u}_t=W^O[{\bf o}_{t,1};{\bf o}_{t,2};\dots;{\bf o}_{t,n_h}]\]MLA
- 使用低秩矩阵分解降低 KV 缓存,以及推理过程中的内存占用
- 不直接对 Q、K 使用 RoPE,进行解耦
MoE
MoE 是一种稀疏推理
Auxiliary-Loss-Free Load Balancing
专家的负载不均衡会导致路由崩溃,且降低在专家并行场景中的计算效率。通常使用 Auxiliary Loss,但 Auxiliary Loss 如果太大会影响模型性能。
似乎如果过于保证负载均衡,会影响模型性能。为同时保证较好的负载均衡与模型性能,提出了 Auxiliary-Loss-Free Load Balancing。