Transformer Architecture
Hello World! 🚀 graph TD subgraph Transformer_Block A["Input x<br/>(B × T × d_model)"] --> LN1["LayerNorm"] LN1 -->|"Q, K, V = W × x"| Attn["Multi-Head<br/>Sel...
Hello World! 🚀 graph TD subgraph Transformer_Block A["Input x<br/>(B × T × d_model)"] --> LN1["LayerNorm"] LN1 -->|"Q, K, V = W × x"| Attn["Multi-Head<br/>Sel...
Goal This post provides a detailed visual breakdown of a 175B parameter Mixture of Experts (MoE) Transformer architecture. Diagram graph TB %% Input Processing subgraph Input_Processing [...
Why Distributed System for LLM? LLMs require distributed systems due to their scale — in both memory and compute. Too Large to Fit LLMs like GPT-4.5 (~5–7T params) need ~10-14 TB (5 or 7TB...
Hello World! 🚀 Welcome to my tech blog! I’m Li Miao, a Machine Learning Engineer at Google. This blog distills my learnings on Large Language Models (LLMs) and AI essentials Programming best...