Transformer Architecture

Hello World! 🚀 graph TD subgraph Transformer_Block A["Input x<br/>(B × T × d_model)"] --> LN1["LayerNorm"] LN1 -->|"Q, K, V = W × x"| Attn["Multi-Head<br/>Sel...

Jul 3, 2025

Inside a 175B Parameter MoE Transformer: Architecture Deep Dive

Goal This post provides a detailed visual breakdown of a 175B parameter Mixture of Experts (MoE) Transformer architecture. Diagram graph TB %% Input Processing subgraph Input_Processing [...

Jun 28, 2025 LLM, Model

LLM Distributed System

Why Distributed System for LLM? LLMs require distributed systems due to their scale — in both memory and compute. Too Large to Fit LLMs like GPT-4.5 (~5–7T params) need ~10-14 TB (5 or 7TB...

Jun 27, 2025 LLM, Infra

Welcome to My Tech Blog

Hello World! 🚀 Welcome to my tech blog! I’m Li Miao, a Machine Learning Engineer at Google. This blog distills my learnings on Large Language Models (LLMs) and AI essentials Programming best...

Jun 27, 2025

Transformer Architecture

Inside a 175B Parameter MoE Transformer: Architecture Deep Dive

LLM Distributed System

Welcome to My Tech Blog

Trending Tags