Inside a 175B Parameter MoE Transformer: Architecture Deep Dive
Inside a 175B Parameter MoE Transformer: Architecture Deep Dive
Goal
This post provides a detailed visual breakdown of a 175B parameter Mixture of Experts (MoE) Transformer architecture.
Diagram
graph TB
%% Input Processing
subgraph Input_Processing ["π€ Input Processing"]
direction TB
Tokens[Input Tokens<br/>Sequence Length: 2048] --> Embed[Token Embedding<br/>d_model: 12,288]
Embed --> PosEnc[Positional Encoding<br/>Learned/Sinusoidal]
end
%% Single Transformer Block Detail
subgraph Single_Block ["π§ Transformer Block (1 of 64)"]
direction TB
%% Self-Attention
subgraph SelfAttn_Detail ["Multi-Head Self-Attention"]
direction LR
QKV["Q,K,V Projections<br/>96 heads Γ 128 dim"]
QKV --> ScaledDot["Scaled Dot-Product<br/>Attention"]
ScaledDot --> AttnProj["Output Projection<br/>12,288 β 12,288"]
end
%% MoE FFN
subgraph MoE_Detail ["π― Mixture of Experts FFN"]
direction TB
Gate["Gate Network<br/>Top-2 Routing"] --> Router["Router Logic<br/>Load Balancing"]
Router --> ExpertGrid["128 Expert Networks<br/>Each: 12,288 β 49,152 β 12,288"]
ExpertGrid --> Combine["Weighted Combination<br/>Only 2 experts active"]
end
%% Layer connections
PosEnc --> SelfAttn_Detail
SelfAttn_Detail --> Norm1["LayerNorm + Residual"]
Norm1 --> MoE_Detail
MoE_Detail --> Norm2["LayerNorm + Residual"]
end
%% Stack indication
subgraph Layer_Stack ["π 64 Identical Layers"]
direction TB
L1["Layer 1"] --> L2["Layer 2"] --> Dots["..."] --> L64["Layer 64"]
end
%% Output
subgraph Output_Processing ["π€ Output Processing"]
direction TB
FinalNorm["Final LayerNorm"] --> LMHead["Language Model Head<br/>12,288 β 50,257 vocab"]
LMHead --> Softmax["Softmax β Token Probabilities"]
end
%% Distributed System Architecture
subgraph Distributed_Arch ["π Distributed Training (32Γ A100 GPUs)"]
direction TB
subgraph Parallelism_Strategy ["Parallelism Strategy"]
direction LR
TP["π Tensor Parallel<br/>Split attention/FFN<br/>within layers"]
PP["β‘ Pipeline Parallel<br/>16 layers per GPU<br/>4 pipeline stages"]
EP["π― Expert Parallel<br/>4 experts per GPU<br/>32 GPUs total"]
end
subgraph Memory_Compute ["πΎ Memory & Compute"]
direction TB
Mem["Memory per GPU:<br/>~40GB model + 40GB activations"]
Compute["Compute: ~312 TFLOPS<br/>Effective: ~25% MFU"]
Comm["Communication:<br/>NVLink + InfiniBand"]
end
end
%% Main flow connections
Input_Processing --> Single_Block
Single_Block --> Layer_Stack
Layer_Stack --> Output_Processing
%% Model statistics
subgraph Model_Stats ["π Model Statistics"]
direction TB
TotalParams["Total Parameters:<br/>~175B (including experts)"]
ActiveParams["Active Parameters:<br/>~22B per forward pass"]
TrainingCost["Training Cost:<br/>~$4.6M @ $2/GPU-hour"]
end
%% Styling
classDef inputStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px
classDef blockStyle fill:#f3e5f5,stroke:#4a148c,stroke-width:2px
classDef moeStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
classDef distStyle fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
classDef statsStyle fill:#fff8e1,stroke:#ff6f00,stroke-width:2px
class Input_Processing inputStyle
class Single_Block,Layer_Stack,Output_Processing blockStyle
class MoE_Detail moeStyle
class Distributed_Arch distStyle
class Model_Stats statsStyle
πΊοΈ Model-level flow
graph LR
classDef blk fill:#fafafa,stroke:#999,stroke-width:1px
subgraph IN["Input β 2048 tokens"]
T[Tokens]:::blk --> E[Embed<br/>d = 12,288]:::blk --> P[PosEnc]:::blk
end
subgraph STACK["64 Γ Transformer Layers"]
IN --> B[see next diagram βΊ]:::blk
end
STACK --> OUT[Final LN β LM Head β Softmax]:::blk
π Single Transformer Layer
graph LR
classDef attn fill:#f0f8ff,stroke:#3182bd,stroke-width:1px
classDef moe fill:#fff7e6,stroke:#d94801,stroke-width:1px
%% Attention
subgraph ATTN["Multi-Head Self-Attn"]
QKV[Q K V proj<br/>96 Γ 128]:::attn --> SD[Scaled Dot]:::attn --> AO[Out proj]:::attn
end
%% MoE FFN
subgraph MOEFFN["Mixture-of-Experts FFN"]
Gate[Gate net<br/>Top-2]:::moe --> Rt[Router]:::moe -->
Exp[128 experts<br/>12,288β49,152β12,288]:::moe --> Comb[Combine]:::moe
end
P[Prev h] --> ATTN --> LN1[+ Residual / LN]:::attn --> MOEFFN --> LN2[+ Residual / LN]:::attn
π Parallelism & Cluster Layout
graph TB
classDef c fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px
TP[Tensor Parallel<br/>split QKV & FFN]:::c
PP[Pipeline Parallel<br/>4 stages]:::c
EP[Expert Parallel<br/>4 experts / GPU]:::c
TP --- PP --- EP
subgraph CLUSTER["32 Γ A100 β 80 GB each"]
TP & PP & EP --> NET[NVLink + InfiniBand]:::c
end
This post is licensed under CC BY 4.0 by the author.