购买
下载掌阅APP,畅读海量书库
立即打开
畅读海量书库
扫码下载掌阅APP

参考文献

[1]SENNRICH R, HADDOW B, BIRCH A. Neural machine translation of rare words with subword units [J]. arXiv, 2016-6-10 (2024-7-25).

[2]KUDO T. Subword regularization: Improving neural network translation models with multiple subword candidates [J]. arXiv, 2018-4-29 (2024-7-25).

[3]DEVLIN J, CHANG M W, LEE K, et al. BERT: Pre-training of deep bidirectional transformers for language understanding [J]. arXiv, 2019-5-24 (2024-7-25).

[4]Google Sentencepiece官网. https://github.com/google/sentencepiece.

[5]SU J L, LU Y, PAN S F, et al. RoFormer: Enhanced Transformer with rotary position embedding [J]. arXiv, 2023-11-8 (2024-7-25).

[6]PRESS O, SMITH N A, LEWIS M. Train short, test long-attention with linear biases enables input length extrapolation [J]. arXiv, 2022-4-22 (2024-7-25).

[7]VASWANI A, SHAZEER N, PARMAR N, et al. Attention is all you need [J]. arXiv, 2023-8-2 (2024-7-25).

[8]TUNSTALL L, WERRA L V, WOLF T. Natural language processing with Transformers [M]. O'Reilly Media Inc, 2022.

[9]JACOBS R A, JORDAN M I, NOWLAN S J, et al. Adaptive mixture of local experts [J]. Neural Computation, 1991 (3): 79-87.

[10]SHAZEER N, MIRHOSEINI A, MAZIARZ K, et al. Outrageously large neural networks: The sparsely-gated mixture-of-experts layer [J]. arXiv, 2017-1-23 (2024-7-25).

[11]FEDUS W, ZOPH B, SHAZEER N. Switch Transformers: scaling to trillion parameter models with simple and efficient sparsity [J]. arXiv, 2022-6-16 (2024-7-25).

[12]ZOPH B, BELLO I, KUMAR S, et al. ST-MoE: designing stable and transferable sparse expert models [J]. arXiv, 2022-4-29 (2024-7-25).

[13]JIANG A Q, SABLAYROLLES A, ROUX A, et al. Mixtral of experts [J]. arXiv, 2024-1-8 (2024-7-25).

[14]KAIO. Deeper notes about extending context [EB/OL].(2023-06-27)[2024-07-25]. https://kaiokendev.github.io/context.

[15] CHEN S Y, WONG S, CHEN L J, et al. Extending context window of large language models via positional interpolation [J]. arXiv, 2023-6-28 (2024-7-25).

[16]LI D C, SHAO R L, XIE A Z, et al. How long can open-source LLMs truly promise on context length?[EB/OL].(2023-06-29)[2024-07-25]. https://lmsys.org/blog/2023-06-29-longchat/.

[17]XIONG W H, LIU J Y, MOLYBOG I, et al. Effective long-context scaling of foundation models [J]. arXiv, 2023-11-14 (2024-7-25).

[18]DAO T, FU D Y, ERMON S, et al. FlashAttention: fast and memory-efficient exact attention with IO-awareness [J]. arXiv, 2022-6-23 (2024-7-25).

[19]LEFAUDEUX B, MASSA F, LISKOVICHD, et al. xFormers: Amodular and hackable Transformer modelling library [EB/OL]. [2024-07-25]. https://github. com/facebookresearch/xformers.

[20]DAO T. FlashAttention-2: faster attention with better parallelism and work partitioning [J]. arXiv, 2023-7-17 (2024-7-25). G0rN6AoZu1f2jQ3uvTWCVJbgitblrLpVw/kmn31bI6GHs1sknJX2zC8MUP4K6nKX

点击中间区域
呼出菜单
上一章
目录
下一章
×