Merge b4e06d883e655af15b9510ea155a6ae013720c54 into 592fd5daf8177b205af11651bbb31a1834a8b0e0

Delete CITATION.cff
Update bib info
2025-05-02 13:36:28 +02:00 · 2025-02-24 11:44:08 +06:00 · 2025-02-24 11:50:20 +08:00 · 2025-02-24 11:25:44 +08:00 · 2025-01-30 14:05:58 +03:30
3 changed files with 29 additions and 246 deletions
--- a/CITATION.cff
+++ b/CITATION.cff
@ -1,215 +0,0 @@
-cff-version: 1.2.0
-message: "If you use this work, please cite it using the following metadata."
-title: "DeepSeek-V3 Technical Report"
-authors:
-  - name: "DeepSeek-AI"
-  - name: "Aixin Liu"
-  - name: "Bei Feng"
-  - name: "Bing Xue"
-  - name: "Bingxuan Wang"
-  - name: "Bochao Wu"
-  - name: "Chengda Lu"
-  - name: "Chenggang Zhao"
-  - name: "Chengqi Deng"
-  - name: "Chenyu Zhang"
-  - name: "Chong Ruan"
-  - name: "Damai Dai"
-  - name: "Daya Guo"
-  - name: "Dejian Yang"
-  - name: "Deli Chen"
-  - name: "Dongjie Ji"
-  - name: "Erhang Li"
-  - name: "Fangyun Lin"
-  - name: "Fucong Dai"
-  - name: "Fuli Luo"
-  - name: "Guangbo Hao"
-  - name: "Guanting Chen"
-  - name: "Guowei Li"
-  - name: "H. Zhang"
-  - name: "Han Bao"
-  - name: "Hanwei Xu"
-  - name: "Haocheng Wang"
-  - name: "Haowei Zhang"
-  - name: "Honghui Ding"
-  - name: "Huajian Xin"
-  - name: "Huazuo Gao"
-  - name: "Hui Li"
-  - name: "Hui Qu"
-  - name: "J. L. Cai"
-  - name: "Jian Liang"
-  - name: "Jianzhong Guo"
-  - name: "Jiaqi Ni"
-  - name: "Jiashi Li"
-  - name: "Jiawei Wang"
-  - name: "Jin Chen"
-  - name: "Jingchang Chen"
-  - name: "Jingyang Yuan"
-  - name: "Junjie Qiu"
-  - name: "Junlong Li"
-  - name: "Junxiao Song"
-  - name: "Kai Dong"
-  - name: "Kai Hu"
-  - name: "Kaige Gao"
-  - name: "Kang Guan"
-  - name: "Kexin Huang"
-  - name: "Kuai Yu"
-  - name: "Lean Wang"
-  - name: "Lecong Zhang"
-  - name: "Lei Xu"
-  - name: "Leyi Xia"
-  - name: "Liang Zhao"
-  - name: "Litong Wang"
-  - name: "Liyue Zhang"
-  - name: "Meng Li"
-  - name: "Miaojun Wang"
-  - name: "Mingchuan Zhang"
-  - name: "Minghua Zhang"
-  - name: "Minghui Tang"
-  - name: "Mingming Li"
-  - name: "Ning Tian"
-  - name: "Panpan Huang"
-  - name: "Peiyi Wang"
-  - name: "Peng Zhang"
-  - name: "Qiancheng Wang"
-  - name: "Qihao Zhu"
-  - name: "Qinyu Chen"
-  - name: "Qiushi Du"
-  - name: "R. J. Chen"
-  - name: "R. L. Jin"
-  - name: "Ruiqi Ge"
-  - name: "Ruisong Zhang"
-  - name: "Ruizhe Pan"
-  - name: "Runji Wang"
-  - name: "Runxin Xu"
-  - name: "Ruoyu Zhang"
-  - name: "Ruyi Chen"
-  - name: "S. S. Li"
-  - name: "Shanghao Lu"
-  - name: "Shangyan Zhou"
-  - name: "Shanhuang Chen"
-  - name: "Shaoqing Wu"
-  - name: "Shengfeng Ye"
-  - name: "Shirong Ma"
-  - name: "Shiyu Wang"
-  - name: "Shuang Zhou"
-  - name: "Shuiping Yu"
-  - name: "Shunfeng Zhou"
-  - name: "Shuting Pan"
-  - name: "T. Wang"
-  - name: "Tao Yun"
-  - name: "Tian Pei"
-  - name: "Tianyu Sun"
-  - name: "W. L. Xiao"
-  - name: "Wangding Zeng"
-  - name: "Wanjia Zhao"
-  - name: "Wei An"
-  - name: "Wen Liu"
-  - name: "Wenfeng Liang"
-  - name: "Wenjun Gao"
-  - name: "Wenqin Yu"
-  - name: "Wentao Zhang"
-  - name: "X. Q. Li"
-  - name: "Xiangyue Jin"
-  - name: "Xianzu Wang"
-  - name: "Xiao Bi"
-  - name: "Xiaodong Liu"
-  - name: "Xiaohan Wang"
-  - name: "Xiaojin Shen"
-  - name: "Xiaokang Chen"
-  - name: "Xiaokang Zhang"
-  - name: "Xiaosha Chen"
-  - name: "Xiaotao Nie"
-  - name: "Xiaowen Sun"
-  - name: "Xiaoxiang Wang"
-  - name: "Xin Cheng"
-  - name: "Xin Liu"
-  - name: "Xin Xie"
-  - name: "Xingchao Liu"
-  - name: "Xingkai Yu"
-  - name: "Xinnan Song"
-  - name: "Xinxia Shan"
-  - name: "Xinyi Zhou"
-  - name: "Xinyu Yang"
-  - name: "Xinyuan Li"
-  - name: "Xuecheng Su"
-  - name: "Xuheng Lin"
-  - name: "Y. K. Li"
-  - name: "Y. Q. Wang"
-  - name: "Y. X. Wei"
-  - name: "Y. X. Zhu"
-  - name: "Yang Zhang"
-  - name: "Yanhong Xu"
-  - name: "Yanping Huang"
-  - name: "Yao Li"
-  - name: "Yao Zhao"
-  - name: "Yaofeng Sun"
-  - name: "Yaohui Li"
-  - name: "Yaohui Wang"
-  - name: "Yi Yu"
-  - name: "Yi Zheng"
-  - name: "Yichao Zhang"
-  - name: "Yifan Shi"
-  - name: "Yiliang Xiong"
-  - name: "Ying He"
-  - name: "Ying Tang"
-  - name: "Yishi Piao"
-  - name: "Yisong Wang"
-  - name: "Yixuan Tan"
-  - name: "Yiyang Ma"
-  - name: "Yiyuan Liu"
-  - name: "Yongqiang Guo"
-  - name: "Yu Wu"
-  - name: "Yuan Ou"
-  - name: "Yuchen Zhu"
-  - name: "Yuduan Wang"
-  - name: "Yue Gong"
-  - name: "Yuheng Zou"
-  - name: "Yujia He"
-  - name: "Yukun Zha"
-  - name: "Yunfan Xiong"
-  - name: "Yunxian Ma"
-  - name: "Yuting Yan"
-  - name: "Yuxiang Luo"
-  - name: "Yuxiang You"
-  - name: "Yuxuan Liu"
-  - name: "Yuyang Zhou"
-  - name: "Z. F. Wu"
-  - name: "Z. Z. Ren"
-  - name: "Zehui Ren"
-  - name: "Zhangli Sha"
-  - name: "Zhe Fu"
-  - name: "Zhean Xu"
-  - name: "Zhen Huang"
-  - name: "Zhen Zhang"
-  - name: "Zhenda Xie"
-  - name: "Zhengyan Zhang"
-  - name: "Zhewen Hao"
-  - name: "Zhibin Gou"
-  - name: "Zhicheng Ma"
-  - name: "Zhigang Yan"
-  - name: "Zhihong Shao"
-  - name: "Zhipeng Xu"
-  - name: "Zhiyu Wu"
-  - name: "Zhongyu Zhang"
-  - name: "Zhuoshu Li"
-  - name: "Zihui Gu"
-  - name: "Zijia Zhu"
-  - name: "Zijun Liu"
-  - name: "Zilin Li"
-  - name: "Ziwei Xie"
-  - name: "Ziyang Song"
-  - name: "Ziyi Gao"
-  - name: "Zizheng Pan"
-year: 2024
-identifiers:
-  - type: doi
-    value: 10.48550/arXiv.2412.19437
-  - type: arXiv
-    value: 2412.19437
-url: "https://arxiv.org/abs/2412.19437"
-categories:
-  - "cs.CL"
-repository-code: "https://github.com/deepseek-ai/DeepSeek-V3"
-license: "MIT"
-abstract: >
-  We present DeepSeek-V3, a strong Mixture-of-Experts (MoE) language model with 671B total parameters with 37B activated for each token. To achieve efficient inference and cost-effective training, DeepSeek-V3 adopts Multi-head Latent Attention (MLA) and DeepSeekMoE architectures, which were thoroughly validated in DeepSeek-V2. Furthermore, DeepSeek-V3 pioneers an auxiliary-loss-free strategy for load balancing and sets a multi-token prediction training objective for stronger performance. We pre-train DeepSeek-V3 on 14.8 trillion diverse and high-quality tokens, followed by Supervised Fine-Tuning and Reinforcement Learning stages to fully harness its capabilities. Comprehensive evaluations reveal that DeepSeek-V3 outperforms other open-source models and achieves performance comparable to leading closed-source models. Despite its excellent performance, DeepSeek-V3 requires only 2.788M H800 GPU hours for its full training. In addition, its training process is remarkably stable. Throughout the entire training process, we did not experience any irrecoverable loss spikes or perform any rollbacks.
--- a/README.md
+++ b/README.md
@ -343,7 +343,7 @@ This code repository is licensed under [the MIT License](LICENSE-CODE). The use
 ```
@misc{deepseekai2024deepseekv3technicalreport,
      title={DeepSeek-V3 Technical Report}, 
-      author={DeepSeek-AI and Aixin Liu and Bei Feng and Bing Xue and Bingxuan Wang and Bochao Wu and Chengda Lu and Chenggang Zhao and Chengqi Deng and Chenyu Zhang and Chong Ruan and Damai Dai and Daya Guo and Dejian Yang and Deli Chen and Dongjie Ji and Erhang Li and Fangyun Lin and Fucong Dai and Fuli Luo and Guangbo Hao and Guanting Chen and Guowei Li and H. Zhang and Han Bao and Hanwei Xu and Haocheng Wang and Haowei Zhang and Honghui Ding and Huajian Xin and Huazuo Gao and Hui Li and Hui Qu and J. L. Cai and Jian Liang and Jianzhong Guo and Jiaqi Ni and Jiashi Li and Jiawei Wang and Jin Chen and Jingchang Chen and Jingyang Yuan and Junjie Qiu and Junlong Li and Junxiao Song and Kai Dong and Kai Hu and Kaige Gao and Kang Guan and Kexin Huang and Kuai Yu and Lean Wang and Lecong Zhang and Lei Xu and Leyi Xia and Liang Zhao and Litong Wang and Liyue Zhang and Meng Li and Miaojun Wang and Mingchuan Zhang and Minghua Zhang and Minghui Tang and Mingming Li and Ning Tian and Panpan Huang and Peiyi Wang and Peng Zhang and Qiancheng Wang and Qihao Zhu and Qinyu Chen and Qiushi Du and R. J. Chen and R. L. Jin and Ruiqi Ge and Ruisong Zhang and Ruizhe Pan and Runji Wang and Runxin Xu and Ruoyu Zhang and Ruyi Chen and S. S. Li and Shanghao Lu and Shangyan Zhou and Shanhuang Chen and Shaoqing Wu and Shengfeng Ye and Shengfeng Ye and Shirong Ma and Shiyu Wang and Shuang Zhou and Shuiping Yu and Shunfeng Zhou and Shuting Pan and T. Wang and Tao Yun and Tian Pei and Tianyu Sun and W. L. Xiao and Wangding Zeng and Wanjia Zhao and Wei An and Wen Liu and Wenfeng Liang and Wenjun Gao and Wenqin Yu and Wentao Zhang and X. Q. Li and Xiangyue Jin and Xianzu Wang and Xiao Bi and Xiaodong Liu and Xiaohan Wang and Xiaojin Shen and Xiaokang Chen and Xiaokang Zhang and Xiaosha Chen and Xiaotao Nie and Xiaowen Sun and Xiaoxiang Wang and Xin Cheng and Xin Liu and Xin Xie and Xingchao Liu and Xingkai Yu and Xinnan Song and Xinxia Shan and Xinyi Zhou and Xinyu Yang and Xinyuan Li and Xuecheng Su and Xuheng Lin and Y. K. Li and Y. Q. Wang and Y. X. Wei and Y. X. Zhu and Yang Zhang and Yanhong Xu and Yanhong Xu and Yanping Huang and Yao Li and Yao Zhao and Yaofeng Sun and Yaohui Li and Yaohui Wang and Yi Yu and Yi Zheng and Yichao Zhang and Yifan Shi and Yiliang Xiong and Ying He and Ying Tang and Yishi Piao and Yisong Wang and Yixuan Tan and Yiyang Ma and Yiyuan Liu and Yongqiang Guo and Yu Wu and Yuan Ou and Yuchen Zhu and Yuduan Wang and Yue Gong and Yuheng Zou and Yujia He and Yukun Zha and Yunfan Xiong and Yunxian Ma and Yuting Yan and Yuxiang Luo and Yuxiang You and Yuxuan Liu and Yuyang Zhou and Z. F. Wu and Z. Z. Ren and Zehui Ren and Zhangli Sha and Zhe Fu and Zhean Xu and Zhen Huang and Zhen Zhang and Zhenda Xie and Zhengyan Zhang and Zhewen Hao and Zhibin Gou and Zhicheng Ma and Zhigang Yan and Zhihong Shao and Zhipeng Xu and Zhiyu Wu and Zhongyu Zhang and Zhuoshu Li and Zihui Gu and Zijia Zhu and Zijun Liu and Zilin Li and Ziwei Xie and Ziyang Song and Ziyi Gao and Zizheng Pan},
+      author={DeepSeek-AI},
      year={2024},
      eprint={2412.19437},
      archivePrefix={arXiv},
--- a/README_WEIGHTS.md
+++ b/README_WEIGHTS.md
@ -2,30 +2,30 @@

 ## New Fields in `config.json`

- **model_type**: Specifies the model type, which is updated to `deepseek_v3` in this release.
- **num_nextn_predict_layers**: Indicates the number of Multi-Token Prediction (MTP) Modules. The open-sourced V3 weights include **1 MTP Module** .
- **quantization_config**: Describes the configuration for FP8 quantization.
+- **model_type**: Specifies the model type, which is now set to `deepseek_v3` in this release.
+- **num_nextn_predict_layers**: Defines the number of Multi-Token Prediction (MTP) Modules. The open-sourced V3 weights contain **1 MTP Module**.
+- **quantization_config**: Details the configuration for FP8 quantization.

 ---

-## Weight Structure Overview
+## Weight File Structure Overview

-The DeepSeek-V3 weight file consists of two main components: **Main Model Weights** and **MTP Modules**.
+The DeepSeek-V3 weight file is divided into two primary components: **Main Model Weights** and **MTP Modules**.

 ### 1. Main Model Weights

 - **Composition**:
-  - Input/output embedding layers and a complete set of 61 Transformer hidden layers.
+  - Includes input/output embedding layers and a full set of 61 Transformer hidden layers.
 - **Parameter Count**:
  - Total parameters: **671B**
-  - Activation parameters: **36.7B** (including 0.9B for Embedding and 0.9B for the output Head).
+  - Activation parameters: **36.7B** (which includes 0.9B for Embedding and 0.9B for the Output Head).

 #### Structural Details

 - **Embedding Layer**:
  - `model.embed_tokens.weight`
 - **Transformer Hidden Layers**:
-  - `model.layers.0` to `model.layers.60`, totaling `num_hidden_layers` layers.
+  - From `model.layers.0` to `model.layers.60`, which correspond to `num_hidden_layers` layers.
 - **Output Layer**:
  - `model.norm.weight`
  - `lm_head.weight`
@ -33,37 +33,37 @@ The DeepSeek-V3 weight file consists of two main components: **Main Model Weight
 ### 2. Multi-Token Prediction (MTP) Modules

 - **Composition**:
-  - Additional MTP Modules defined by the `num_nextn_predict_layers` field. In this model, the value is set to 1.
+  - These modules are determined by the `num_nextn_predict_layers` parameter. In this model, the value is set to 1.
 - **Parameter Count**:
-  - Parameters: **11.5B unique parameters**, excluding the shared 0.9B Embedding and 0.9B output Head).
-  - Activation parameters: **2.4B** (including the shared 0.9B Embedding and 0.9B output Head).
+  - Parameters: **11.5B unique parameters** (excluding the shared 0.9B Embedding and 0.9B Output Head).
+  - Activation parameters: **2.4B** (including the shared 0.9B Embedding and 0.9B Output Head).

 #### Structural Details

- **embed_tokens**: **Shares parameters** with the Embedding layer of the Main Model weights.
- **enorm & hnorm**: RMSNorm parameters required for speculative decoding.
- **eh_proj**: Parameters for dimensionality reduction projection on the norm results.
+- **embed_tokens**: **Shares parameters** with the Main Model’s Embedding layer.
+- **enorm & hnorm**: RMSNorm parameters used for speculative decoding.
+- **eh_proj**: Parameters used for dimensionality reduction of the normalized outputs.
 - **Additional Transformer Hidden Layer**:
-  - `model.layers.61.self_attn & mlp` (structure identical to the Main Model hidden layers).
- **shared_head**: **Shares parameters** with the output Head of the Main Model weights.
+  - `model.layers.61.self_attn & mlp` (these are structured the same as the Main Model hidden layers).
+- **shared_head**: **Shares parameters** with the Output Head of the Main Model.

 ---

-### Loading Rules
+### Layer Loading Rules

- **Main Model Weights**: Loaded via the `num_hidden_layers` parameter in `config.json`.
- **MTP Modules**: Loaded via the `num_nextn_predict_layers` parameter, with layer IDs appended immediately after the Main Model hidden layers. For example:
-  - If `num_hidden_layers = 61` and `num_nextn_predict_layers = 1`, the MTP Module's layer ID is `61`.
+- **Main Model Weights**: These are loaded according to the `num_hidden_layers` field in `config.json`.
+- **MTP Modules**: These are loaded using the `num_nextn_predict_layers` field, with MTP layer IDs appended directly after the Main Model’s hidden layers. For example:
+  - With `num_hidden_layers = 61` and `num_nextn_predict_layers = 1`, the MTP Module layer ID will be `61`.

 ---

 ## FP8 Weight Documentation

-DeepSeek-V3 natively supports FP8 weight format with 128x128 block scaling.
+DeepSeek-V3 natively supports the FP8 weight format with 128x128 block scaling.

 ### FP8 Configuration

-The FP8 weight file introduces a `quantization_config` field to describe the quantization method. Below is an example configuration:
+The FP8 weight file introduces a `quantization_config` field, which defines the quantization method. Below is an example of the configuration:

 ```json
 "quantization_config": {
@ -75,20 +75,18 @@ The FP8 weight file introduces a `quantization_config` field to describe the qua
 ```

 - **Quantization Format**:
-  - Format type: `fp8` and `e4m3` (corresponding to `torch.float8_e4m3fn`).
+  - Format type: `fp8` and `e4m3` (aligned with `torch.float8_e4m3fn`).
  - Weight block size: `128x128`.
 - **Activation Quantization Scheme**:
-  - Utilizes dynamic activation quantization (`dynamic`).
+  - Uses dynamic activation quantization (`dynamic`).

 ### Dequantization Method

 The FP8 weight file includes a `weight_scale_inv` field, which stores the dequantization scale for each weight block.

- **Storage Format**: `float32 Tensor`, stored alongside the weight data.
+- **Storage Format**: Stored as a `float32 Tensor`, alongside the weight data.
 - **Dequantization Formula**:
-  - If the weight block is not aligned to 128, it is zero-padded to 128 before calculating the scale. After quantization, the padded portion is removed.
-  - The dequantization process is performed as: `(128x128 weight block) * weight_scale_inv`.
+  - If a weight block is not aligned to 128, it is zero-padded to 128 before calculating the scale. The padded portion is discarded after quantization.
+  - Dequantization is performed using the formula: `(128x128 weight block) * weight_scale_inv`.

-Through dequantization of the FP8 weights, runtime operations enable online quantization at a granularity of `per-token-per-128-channel`.
-
---
+This dequantization process enables runtime operations to apply online quantization on a per-token, per-128-channel basis.
Author	SHA1	Message	Date
Muhammad-Noraeii	68fd5c81cc	Merge b4e06d883e655af15b9510ea155a6ae013720c54 into 592fd5daf8177b205af11651bbb31a1834a8b0e0	2025-02-24 11:44:08 +06:00
DeepSeekDDM	592fd5daf8	Delete CITATION.cff Some checks failed Mark and close stale issues / stale (push) Has been cancelled Details	2025-02-24 11:50:20 +08:00
DeepSeekDDM	c9353aba6c	Update bib info	2025-02-24 11:25:44 +08:00
Muhammad-Noraeii	b4e06d883e	Improve DeepSeek-V3 Weight File Documentation for Clarity and Readability - Enhanced sentence structure for better clarity and smoother flow. - Adjusted wording and phrasing to improve accuracy and professionalism. - Optimized the organization of information for better readability, especially in the sections related to parameters and technical details. - Refined formatting and sectioning of the documentation for easier navigation and comprehension.	2025-01-30 14:05:58 +03:30