Post

Inside a 175B Parameter MoE Transformer: Architecture Deep Dive

Inside a 175B Parameter MoE Transformer: Architecture Deep Dive

Goal

This post provides a detailed visual breakdown of a 175B parameter Mixture of Experts (MoE) Transformer architecture.

Diagram

graph TB
    %% Input Processing
    subgraph Input_Processing ["πŸ”€ Input Processing"]
        direction TB
        Tokens[Input Tokens<br/>Sequence Length: 2048] --> Embed[Token Embedding<br/>d_model: 12,288]
        Embed --> PosEnc[Positional Encoding<br/>Learned/Sinusoidal]
    end

    %% Single Transformer Block Detail
    subgraph Single_Block ["🧠 Transformer Block (1 of 64)"]
        direction TB
        
        %% Self-Attention
        subgraph SelfAttn_Detail ["Multi-Head Self-Attention"]
            direction LR
            QKV["Q,K,V Projections<br/>96 heads Γ— 128 dim"] 
            QKV --> ScaledDot["Scaled Dot-Product<br/>Attention"]
            ScaledDot --> AttnProj["Output Projection<br/>12,288 β†’ 12,288"]
        end
        
        %% MoE FFN
        subgraph MoE_Detail ["🎯 Mixture of Experts FFN"]
            direction TB
            Gate["Gate Network<br/>Top-2 Routing"] --> Router["Router Logic<br/>Load Balancing"]
            Router --> ExpertGrid["128 Expert Networks<br/>Each: 12,288 β†’ 49,152 β†’ 12,288"]
            ExpertGrid --> Combine["Weighted Combination<br/>Only 2 experts active"]
        end
        
        %% Layer connections
        PosEnc --> SelfAttn_Detail
        SelfAttn_Detail --> Norm1["LayerNorm + Residual"]
        Norm1 --> MoE_Detail  
        MoE_Detail --> Norm2["LayerNorm + Residual"]
    end

    %% Stack indication
    subgraph Layer_Stack ["πŸ“š 64 Identical Layers"]
        direction TB
        L1["Layer 1"] --> L2["Layer 2"] --> Dots["..."] --> L64["Layer 64"]
    end

    %% Output
    subgraph Output_Processing ["πŸ“€ Output Processing"]
        direction TB
        FinalNorm["Final LayerNorm"] --> LMHead["Language Model Head<br/>12,288 β†’ 50,257 vocab"]
        LMHead --> Softmax["Softmax β†’ Token Probabilities"]
    end

    %% Distributed System Architecture
    subgraph Distributed_Arch ["🌐 Distributed Training (32Γ— A100 GPUs)"]
        direction TB
        
        subgraph Parallelism_Strategy ["Parallelism Strategy"]
            direction LR
            TP["πŸ”€ Tensor Parallel<br/>Split attention/FFN<br/>within layers"]
            PP["⚑ Pipeline Parallel<br/>16 layers per GPU<br/>4 pipeline stages"]  
            EP["🎯 Expert Parallel<br/>4 experts per GPU<br/>32 GPUs total"]
        end
        
        subgraph Memory_Compute ["πŸ’Ύ Memory & Compute"]
            direction TB
            Mem["Memory per GPU:<br/>~40GB model + 40GB activations"]
            Compute["Compute: ~312 TFLOPS<br/>Effective: ~25% MFU"]
            Comm["Communication:<br/>NVLink + InfiniBand"]
        end
    end

    %% Main flow connections
    Input_Processing --> Single_Block
    Single_Block --> Layer_Stack
    Layer_Stack --> Output_Processing
    
    %% Model statistics
    subgraph Model_Stats ["πŸ“Š Model Statistics"]
        direction TB
        TotalParams["Total Parameters:<br/>~175B (including experts)"]
        ActiveParams["Active Parameters:<br/>~22B per forward pass"]
        TrainingCost["Training Cost:<br/>~$4.6M @ $2/GPU-hour"]
    end

    %% Styling
    classDef inputStyle fill:#e1f5fe,stroke:#01579b,stroke-width:2px
    classDef blockStyle fill:#f3e5f5,stroke:#4a148c,stroke-width:2px  
    classDef moeStyle fill:#fff3e0,stroke:#e65100,stroke-width:2px
    classDef distStyle fill:#e8f5e8,stroke:#1b5e20,stroke-width:2px
    classDef statsStyle fill:#fff8e1,stroke:#ff6f00,stroke-width:2px

    class Input_Processing inputStyle
    class Single_Block,Layer_Stack,Output_Processing blockStyle
    class MoE_Detail moeStyle
    class Distributed_Arch distStyle
    class Model_Stats statsStyle

πŸ—ΊοΈ Model-level flow

graph LR
  classDef blk fill:#fafafa,stroke:#999,stroke-width:1px
  subgraph IN["Input β€” 2048 tokens"]
      T[Tokens]:::blk --> E[Embed<br/>d = 12,288]:::blk --> P[PosEnc]:::blk
  end
  subgraph STACK["64 Γ— Transformer Layers"]
      IN --> B[see next diagram β–Ί]:::blk
  end
  STACK --> OUT[Final LN β†’ LM Head β†’ Softmax]:::blk

πŸ” Single Transformer Layer

graph LR
  classDef attn fill:#f0f8ff,stroke:#3182bd,stroke-width:1px
  classDef moe  fill:#fff7e6,stroke:#d94801,stroke-width:1px

  %% Attention
  subgraph ATTN["Multi-Head Self-Attn"]
    QKV[Q K V proj<br/>96 Γ— 128]:::attn --> SD[Scaled Dot]:::attn --> AO[Out proj]:::attn
  end

  %% MoE FFN
  subgraph MOEFFN["Mixture-of-Experts FFN"]
    Gate[Gate net<br/>Top-2]:::moe --> Rt[Router]:::moe -->
    Exp[128 experts<br/>12,288β†’49,152β†’12,288]:::moe --> Comb[Combine]:::moe
  end

  P[Prev h] --> ATTN --> LN1[+ Residual / LN]:::attn --> MOEFFN --> LN2[+ Residual / LN]:::attn

🌐 Parallelism & Cluster Layout

graph TB
  classDef c fill:#e8f5e9,stroke:#2e7d32,stroke-width:1px
  TP[Tensor Parallel<br/>split QKV & FFN]:::c
  PP[Pipeline Parallel<br/>4 stages]:::c
  EP[Expert Parallel<br/>4 experts / GPU]:::c
  TP --- PP --- EP

  subgraph CLUSTER["32 Γ— A100 β€” 80 GB each"]
    TP & PP & EP --> NET[NVLink + InfiniBand]:::c
  end
This post is licensed under CC BY 4.0 by the author.