EXEED AI

Emilio Andere's Recent LinkedIn Posts

Emilio Andere

Emilio Andere

@emi-andere

Co-Founder and CEO at Wafer - Hardware Acceleration for AI

en22 postsLinkedIn

Posts

Emilio Andere

Tech & AI

3mo

the tech lead for TPUv3 software at Google published a 100+ page document on how to build a better AI chip than both TPUs and GPUs. you should go read the whole book, but here's a TLDR of the sections i found most interesting. the thesis is simple: the future is CPUs with systolic arrays. every AI chip is a systolic array with marketing. tensor cores, MXUs, matrix cores, NeuronCores. different names, same 1978 invention. the hardware around them shrinks as a percentage of chip area the larger the arrays get. so the real question isn't how to build a better accelerator. it's: what's the cheapest, most programmable way to drive a systolic array? a systolic array is 2D, so doubling the vector width makes the array 4x bigger, doing 4x more math per cycle. but you still only need one CPU core to drive it. each doubling cuts scalar overhead by 4x. google's TPUv2 had 128x128 systolic arrays while nvidia volta had 4x4. that's 5 doublings: 4^5 = 1024x less scalar work to drive the same matrix math. the entire TPUv2 XLA team was 5 people. he argues they shipped a competitive product against nvidia's massive engineering org because the systolic arrays did the work for them. he also argues warp scheduling is an expensive graphics relic. CUDA's thread model exists because shaders need unpredictable memory access. for AI, where access patterns are completely regular, software pipelining achieves the same utilization without the hardware cost. nvidia supports too many numeric formats. google hasn't added structured sparsity for 5+ years, leaving a 2x gap on the table. both require expensive host computers that don't need to exist. you can't hill-climb your way out of a bloated design one feature at a time. his proposed design removes it all at once. a standard CPU with a large systolic array per core. software-managed SRAM. no host computer, the chip IS the computer. 7-bit integer multiplies (potentially?) with 1:2 structured sparsity. copper trace torus networking on snap-together motherboards. SSD for KV cache via sparse attention. no expensive HBM. the anecdotes from inside google are great too. he candidly describes google's approach to TPUs as "neglect by unambitiousness." when he pressed a senior manager on why google wasn't going all-in the way nvidia does, the response was: "are you for real with this question?" he built a compiler technique that modeled heat production inside the systolic arrays to safely boost TPUv3's clock by 25%, saving hundreds of millions, and named it "compiler throttling." his manager's manager told him he sucked at marketing. whether or not you agree with every conclusion, this is the most technically deep open resource on AI chip design from someone who actually shipped both TPUs and GPUs. link in comments.
471

Emilio Andere

Tech & AI

3mo

nvidia's NVLink maxes out at 72 GPUs, but google connects 9,216 TPUs into a single supercomputer. here's why the 128x scaling gap between GPUs and TPUs exists: the 72-GPU limit comes from port math. each Blackwell GPU has 18 NVLink ports. each NVSwitch chip has 144 ports. 144 / 18 = 8 GPUs per switch plane, 9 planes = 72. the copper cables degrade beyond 2 meters at 200 Gbps PAM4, constraining the entire NVLink domain to a single rack. cross the rack boundary and bandwidth drops 18x, from 1,800 GB/s per GPU over NVLink to roughly 100 GB/s over InfiniBand. google avoids this cliff entirely with optical circuit switching. their Palomar OCS has two arrays of 3D MEMS mirrors on silicon wafers. light enters a fiber, bounces off two tiny mirrors, exits through another fiber. no optical-to-electrical conversion, no packet parsing, no buffering. just redirecting photons. each switch: 136 ports, 108 watts. an equivalent electronic packet switch draws 3,000 watts. each ironwood (TPU v7) pod uses 48 OCS units to connect 144 racks of 64 chips into a 3D torus. 9,216 chips, 42.5 exaFLOPs, 1.77 petabytes of shared HBM. the OCS adds less than 5% to system cost and less than 3% to power. replacing this with InfiniBand would require 568 electronic switches. the torus topology matters too. nvidia uses fat-tree networks between racks, where adding switches scales cost linearly. google's 3D torus scales bisection bandwidth as N^(2/3) with chip count. they can also "twist" the torus by rewiring wrap-around connections through the OCS, improving all-to-all throughput by 1.63x for asymmetric topologies. no physical recabling. the MEMS mirrors just tilt to new positions. the most elegant OCS property is rate independence. a MEMS mirror doesn't care if it's reflecting a 40G, 100G, 400G, or 800G signal. google never replaces the switch, only the transceivers at the endpoints. the same hardware from a decade ago works at today's line rates. electronic switches must be redesigned every generation. amortized across multiple speed upgrades, networking cost drops by roughly 70%. if a chip or link fails, the OCS reroutes in seconds by tilting mirrors. TPU pods achieve 99.98% system availability, roughly 50x less downtime than a static interconnect. jobs don't need physically contiguous racks. the physical and logical topology are fully decoupled. nvidia is now converging on the same approach. they invested $4 billion in Lumentum and Coherent for optical switching. Lumentum's R300 OCS uses the same MEMS technology as google's Palomar. nvidia's Feynman platform (2028) will be the first generation with optical NVLink. the company that built its empire on copper and custom switch ASICs is buying the optical technology google has run in production for a decade.
198

Emilio Andere

Tech & AI

2mo

published the first Peak FLOPS substack post, a first-principles breakdown of Etched's Sohu chip. went through the math, patents, and the chip info from first principles, to try to figure out where their 20x speedup could come from. link in comments, let me know what you think
151

Emilio Andere

Tech & AI

2mo

tesla announced terafab last week: $20-25B to manufacture their new AI inference chip, AI5. the chip has no GPU and no image signal processor. here's how the hardware works from analyzing the available patents: AI5 is 2000-2500 TOPS on a half-reticle die. 144 GB HBM3 at 1.9 TB/s. manufactured at both TSMC 3nm and samsung 2nm under a $16.5B contract through 2033. musk claims 40x inference over AI4 despite only 8x the raw compute. where does the other 5x come from? architecture. tesla removed the GPU, removed the ISP, removed every block that doesn't directly serve neural inference. the entire die is 8-bit MAC arrays. operations like SoftMax that took 40 steps on AI4 run natively on AI5. this only works because tesla is its own sole customer. nvidia designs general-purpose silicon for thousands of customers with unknown workloads. tesla designs for one software stack with known data paths. musk called it "co-signed": the hardware guarantees 8-bit MACs with native bayer input. the software guarantees it will never need a native 16-bit path. narrower contract, fewer wasted transistors. but here's the problem. the vision pipeline needs more than 8 bits of precision at multiple stages. raw cameras output 12-bit bayer data. feature extraction runs at 16-bit. spatial attention computes 3D rotations where a quantization error doesn't stay local, it propagates as a coordinate offset through every downstream voxel. so how do you run 16-bit math on 8-bit hardware? 5 patents published january 15 all describe the same answer: split the value at the byte boundary. upper byte through one MAC path, lower byte through the other, same weight applied to both. recombine at the end. full precision is never lost. it just travels in two lanes. this is how AI5 replaces the ISP. raw 12-bit bayer pixels enter four 2x2 kernels simultaneously, each pulling a different MSB/LSB and color combination. the MAC array itself does the de-mosaicing that a dedicated ISP chip would normally handle. no traditional image processing pipeline exists. the most technically interesting patent handles rotary positional encoding. RoPE encodes spatial position using rotation angles, but on 8-bit hardware θ has exactly 256 possible values. errors compound with each rotation, misplacing objects in 3D space. the fix: store log(θ) instead of θ. RoPE frequencies span 4 orders of magnitude in linear space, but in log space that collapses to a uniform sequence that fits cleanly in 8 bits. multiply in the 8-bit domain, exponentiate back to full precision in a separate 32-bit recovery engine. the full pipeline outputs a queryable 3D voxel dataset. send it coordinates and it returns occupancy, velocity, shape, or object type. no radar, no lidar, no object detection layer. first deployment will be on robotaxi, but the voxel pipeline doesn't care what's in the scene. the same circuit that reads a lane boundary reads a warehouse floor for optimus.
179

Emilio Andere

Tech & AI

2mo

the "CUDA moat" is the extreme cost of writing fast software for AI hardware. two things have quickly changed: LLMs are collapsing the cost of writing code and non-NVIDIA hardware is getting competitive. so what's left? to make AI hardware viable at scale (assuming you're not going to be your own customer, which is a separate topic), i'd argue you need two things. 1. functional programmability: is it easy for developers to use your chip? 2. performance ROI: how hard is it to make my software performant on your chip? NVIDIA is currently far superior to everyone else on both of these. AMD is the closest, by far. they're pushing hard on software and usability, but they're still significantly behind. every other alternative chip needs an equivalent software stack built largely from scratch, and they're nowhere near where NVIDIA or even AMD is. keep in mind that this is incredibly complex software that requires one of the most expensive and scarce human talents in the world to write. this is the reason people have historically assigned so much weight to the CUDA moat. this type of complex software really is historically extremely hard to replicate. two important things have changed in the last year. 1. code is getting much cheaper to write. a senior GPU kernel engineer costs $350-500K/year. running Opus at the same task throughput costs a fraction of that. Opus and a talented senior engineer are not at the same level yet. but given the trajectory of coding LLMs, it's clear that this will be the case in at most 1-2 years. 2. competing hardware is already competitive at the silicon level. compare the raw FLOPS, memory bandwidth, and interconnect of an MI355X or TPU Ironwood or Trainium3 to a Blackwell. the entire basis of CUDA being a true "moat" is the incredibly complex software they've built over time. but how many Claude Codes and Codexes does it take to write an equivalent CUDA for your chip? the answer is not anywhere close to 0 yet, but the cost is clearly collapsing. so then what's actually still missing to break this CUDA moat? i'd like to propose that maybe nothing is missing. maybe the moat is already dying -- it's just still very unevenly distributed. the three largest AI companies (anthropic, google, openai) are all running production inference on non-NVIDIA silicon right now. but, the reason Anthropic can deploy Claude on Trainium and TPU is because they can afford to pay hundreds of millions of dollars for a performance team. this is true of very few companies. for most companies in the world the economics have not made sense. the nvidia moat is dying it's just still unevenly distributed to the people that can afford but given the declining cost of software, and the rise of competitive alternative hardware, my honest opinion is that it's hard for me to see a world where the CUDA moat remains in full force over the next 5 years. if this is the kind of problem you want to work on, we're hiring: wafer.ai/careers
109

Emilio Andere

Tech & AI

2mo

Wrote the most comprehensive technical breakdown I could of why I believe AI inference hardware will become completely heterogeneous over the next decade. It covers AI inference hardware math from first principles, prefill/decode disaggregation, the Groq LP30 microarchitecture, and 12+ companies building decode-specialized silicon to battle NVIDIA dominance. Let me know what you think!
368

Emilio Andere

Tech & AI

2mo

anthropic runs claude on over a million trainium2 chips with all-custom NKI kernels. here's how kernel optimization actually works on trainium2: the compute model is completely different from traditional NVIDIA/AMD GPUs. there are no warps, no threads, and no syncthreads! each NeuronCore-v4 has 4 engines running in parallel. Tensor Engine does matmul on a 128x128 systolic array, Vector Engine handles reductions, Scalar Engine handles activations (exp, gelu, silu), GPSIMD runs custom C/C++. you pipeline across engines instead of across threads. while Tensor Engine computes the current matmul, DMA prefetches the next tile from HBM and Vector Engine processes softmax on the previous result. all simultaneously. the NKI attention tutorial walks through 11 versions of the same kernel, each targeting a different engine bottleneck. by v6 they fuse subtract + exp + reduce into a single Scalar Engine instruction, freeing Vector Engine entirely for other work. take FlashAttention. the algorithm exists because GPU SRAM can't hold the NxN attention matrix, so online softmax tiles through 64-128 tokens at a time. on Trainium3, the attention kernel processes 2048 KV tokens per tile. for a 4K context, that's 2 outer-loop tiles vs 32+ on a GPU. fewer tiles and fewer HBM round trips, with overall less overhead. the GPU occupancy tradeoff doesn't exist here. no bank conflicts either. SBUF partitions are accessed deterministically by the systolic array, not by warps racing to the same bank. so what replaces GPU-style optimization? SBUF capacity management. 32 MiB of on-chip scratchpad per NeuronCore sounds generous until a BF16 [8192, 8192] weight matrix is 128 MiB. you still tile everything, the tiles are just much larger. and if your working set overflows SBUF, the compiler silently spills to HBM. this failure mode is a cliff, not a slope. at 96% SBUF utilization you get 29 MB of spill traffic. at 99% you get 931 MB. 32x more HBM traffic from 3% more utilization. why? once any tensor gets evicted, it needs to be reloaded later, which displaces something else, which then also spills. you cross the capacity line and it cascades. you go from ~10 TB/s SBUF bandwidth to 2.9 TB/s HBM with no warning beyond profiling. the partition dimension is locked at 128. every tensor's first dim must be ≤128. PSUM (the matmul accumulator) is only 2 MiB, free dim capped at 512. the compiler allocates SBUF automatically but it's heuristic-based and doesn't always get it right. AWS built an entire direct allocation API so you can do it yourself. the ecosystem is early but scaling fast. Anthropic running >1M of these chips, OpenAI committed $138B over 8 years, and other cutting-edge companies like Decart running these chips super efficiently. GPU optimization is all about managing scarce precious fast memory: 228 KB, occupancy, bank conflicts, register pressure. trainium optimization is managing abundance: 32 MiB, 4-engine pipelining, SBUF capacity cliffs.
213

Emilio Andere

Tech & AI

2mo

you might assume sampling is free in LLM inference, but at small batch sizes sampling eats 10-38% of decode time. 3 kernel launches to pick 1 token from 150k options. FlashSampling is a novel approach inspired by FlashAttention which fuses all of it into the matmul. here's how it works: the last layer of an LLM is a matrix multiply. the hidden state (4096 elements) times a vocabulary weight matrix (4096 x 150,000). the output is 150,000 logits, one score per possible next token. the standard pipeline writes all 150,000 to HBM, launches a softmax kernel that reads them back and writes 150,000 probabilities, then launches a sampling kernel that reads those probabilities and picks one token. three kernels and two full HBM round-trips to select a SINGLE integer. at small batch sizes, this can cost 10-38% of decode time. you might think it's about bandwidth, but 600 KB of logits is nothing for a B300. the real cost is the kernel launches and synchronization barriers. three separate kernels means the GPU stalls twice waiting for one operation to finish before starting the next. so why does sampling need softmax? softmax converts logits into probabilities that sum to 1. you sample by walking a cumulative distribution. the problem is that softmax requires ALL logits to find the max (numerical stability) and compute the global sum. you can't start sampling until every logit is known. that's why the full tensor gets written to HBM. the matmul produces logits tile by tile on SRAM, but softmax needs the global picture. the Gumbel-Max trick removes this dependency entirely. add independent Gumbel(0,1) noise to each raw logit. take the argmax. the winning index is sampled with probability exp(logit_i) / sum exp(logit_j). exactly softmax w/ no normalization needed. why does this help at all? because argmax decomposes. the global maximum equals the maximum of tile-local maxima. so the matmul can stay tile by tile on SRAM. for each tile, add noise, keep only the single best candidate (one index, one score), discard the rest. the 150,000-element tensor never leaves fast memory. only a handful of tile-local winners get written to HBM, and a tiny kernel picks the global winner. FlashSampling implements this as a two-stage fused kernel. stage 1 runs the matmul and Gumbel perturbation together in the GEMM epilogue. stage 2 reduces tile candidates to one token. on B300: 1.84x over multinomial sampling, 2.52x over FlashInfer's top-k/top-p. end-to-end in vLLM on B200: 19% reduction in time per output token on Qwen3-1.7B. this is the FlashAttention playbook applied to sampling. find a mathematical identity that makes a global operation decomposable into local tiles. for attention, it was online softmax. for sampling, it's Gumbel-Max. in both cases, a large intermediate tensor that seemed necessary turns out to be eliminable. if you can reduce data while it's still on SRAM, don't write it to HBM just to read it back. link to paper in the comments.
172

Emilio Andere

Tech & AI

2mo

launching the Peak FLOPS substack, a permanently free substack for AI hardware breakdowns, kernel deep dives, and making inference faster with low level optimizations. i've written 70+ gpu and AI hardware deep dives on linkedin this year. tensor core ISAs, systolic arrays, 8+ accelerator architectures, why megakernels are back, and more. but some of the best topics deserve more space, and i'd love to build up a community of low level optimization enjoyers too! linkedin will stay the same with weekly gpu deep dives, accelerator breakdowns, kernel optimization. Peak FLOPS is for the deeper dives. subscribe here: https://lnkd.in/gC6CaGdv
285

Emilio Andere

Tech & AI

3mo

Cursor rewrote their entire MoE layer from scratch in pure CUDA and PTX. they got a 3.5x MoE layer speedup and 1.5x end-to-end training speedup on Blackwell. let's break down what they did: initially, they tried just quantizing to naive FP8 but this gave them no speedup. on Blackwell, quantizing matrices before feeding them to an FP8 matmul consumes roughly 40% of the matmul time. when you include transpose-quantization for backward passes, it jumps to 76%. you get 2x faster matmul but spend nearly the same time just preparing the inputs. MXFP8 training can actually be slower than BF16 if you don't fuse the quantization. it gets worse on Blackwell specifically. on Hopper, tensor core results accumulate in registers, so you can pipeline dequantization with CUDA cores while the next matmul runs. on Blackwell, results go into a new on-chip memory called TMEM. to do any arithmetic on the accumulator, you transfer from TMEM to registers, process with CUDA cores, write back, and wait. Cursor measured dequantization taking 1.76x the matmul time on Blackwell (vs 1.03x on Hopper). they couldn't even beat Hopper's realistic FP8 throughput with any variation of this approach. the fix is to not dequantize at all. Blackwell's tcgen05.mma block_scale PTX instruction handles MXFP8 block scaling entirely in hardware, inside the tensor cores. no TMEM-to-register transfers, no CUDA core arithmetic. the scaling factors load into TMEM and get consumed during the matrix multiply itself. but you still need to quantize the inputs. existing kernels from TransformerEngine and TorchAO run at ~4.5 TB/s and produce scale factors in the wrong memory layout, requiring a separate reshape kernel. Cursor built a quantization kernel sustaining 6.2+ TB/s that writes scales directly in the hardware-expected packed layout. they also fused quantization into SwiGLU's epilogue, so activations get quantized as they flow through the activation function. no BF16 round-trip through HBM. for grouped GEMM (the actual MoE operation), they beat DeepSeek's DeepGEMM at 0.43ms vs 0.67ms for forward/dgrad. that benchmark excludes DeepGEMM's quantization time, since DeepGEMM doesn't ship optimized quantization kernels. the real-world gap is larger. Cursor uses MXFP8 with 32-element block scaling (FP8E4M3 elements, E8M0 scale factors). DeepSeek V3 used 128-element blocks for the A matrix. finer blocks = better accuracy but more scale factors to manage. Cursor verified 32-block MXFP8 converges nearly identically to BF16. MoE forward went from 25.96ms (Blackwell BF16) to 9.45ms. backward from 59.17ms to 17.04ms. end-to-end: 24k tokens/GPU vs 16k on Blackwell BF16. the kernel was written by Stuart Sul (ML at Cursor), and the full link is in the comments.
370

Emilio Andere

Tech & AI

3mo

deepseek bypassed NCCL for MoE communication. instead of routing every transfer through a CPU proxy, they do all-to-all dispatch in 77 microseconds. here's how DeepEP works under the hood: NCCL is built for symmetric collectives like allreduce, where every GPU sends and receives the same amount. MoE all-to-all is the opposite. a gating network picks different experts for each token at runtime. GPU 0 might send 50 tokens to expert 3 and zero to expert 7. the communication pattern changes every forward pass. NCCL handles this through a CPU proxy thread. the GPU writes a transfer request, the CPU reads it, posts it to the NIC, and polls for completion. this proxy caps out at 1.7 million ops/sec. IBGDA (InfiniBand GPUDirect Async) hits 180 million. in NCCL, the CPU builds it. DeepSeek's DeepEP found a way to build it faster, with the GPU itself. a CUDA warp constructs a 48-byte work queue element, the same binary struct the NIC hardware expects. it has three fields: where to read from (source address + memory key), where to write to (destination address + remote key), and how much data to send. the GPU writes this into a submission ring in GPU memory, then pokes a memory-mapped doorbell register. the NIC picks up the request and executes the RDMA transfer. and there you have it: the sluggish CPU never touched anything. at the warp level: lane 0 atomically reserves a slot in the NIC's queue. all 32 lanes look up memory keys and fill in their part of the descriptor. lane 0 rings the doorbell. one warp + one network transfer. for intra-node peers connected via NVLink, the kernel detects this and writes directly through mapped pointers instead, skipping the NIC entirely. there are two kernel modes. for training, 20 of 132 SMs handle communication while 112 do compute. separate warps for NVLink forwarding, RDMA sending, and receiving. 153 GB/s NVLink, 43-58 GB/s RDMA. for decoding, pure RDMA: 77μs dispatch for 8 GPUs, 194μs for 256. FP8 quantization is fused into the dispatch kernel, no separate pass. deepseek published the exact library they use in production for V3 and R1. AMD ported it with MORI. Tencent added 30% throughput. UCCL extended it to AWS EFA and Broadcom NICs. the GPU networking stack deepseek built in-house is now the foundation everyone else builds on.
504

Emilio Andere

Tech & AI

2mo

NVIDIA's AVO (Agentic Variation Operators) outperformed almost all human GPU engineers by searching continuously for 7 days with no human intervention inside an optimization loop. AVO is a 23-author NVIDIA paper. the author list includes Zihao Ye (created FlashInfer), Tianqi Chen (created TVM), and Ronny Krashinsky (NVIDIA VP of GPU Architecture). the system ran for 7 days of continuous evolution on B200 and produced attention kernels beating cuDNN by 3.5% and FlashAttention-4 by up to 10.5%. here's how it works: most naive LLM-for-code systems (FunSearch, AlphaEvolve) use the LLM as a single-shot candidate generator inside a fixed pipeline. generate code, evaluate, select, repeat. the LLM proposes and the pipeline decides. AVO makes the agent the evolutionary operator. a Claude-powered coding agent gets full autonomy: read profiler output, consult the lineage of what previously worked, propose a targeted edit, compile and benchmark it, critique its own result, and revise. the agent runs the same optimize-profile-iterate loop a real human kernel engineer runs. the difference is it does it thousands of times over 7 days without stopping, completely autonomously. this is why a good profiler matters so much. kernel engineers don't optimize from theory. they profile, find the kernel is bottlenecked on e.g. shared memory throughput at a specific instruction, and fix that one thing. without profiler data, the LLM is just guessing which optimizations to try from theory. with a profiler, the LLM can actually diagnose what's slow and propose a fix. AVO is a great hybrid solution of evolutionary search + agentic autonomy. the variation operator is a coding agent with a profiler, a knowledge base, and the ability to learn from its own history. the model matters, and the feedback loop with the correct tools matters too. link to the full paper is in the comments.
187

Emilio Andere

Tech & AI

2mo

on a B200, HBM memory is 45% of manufacturing cost. here's how HBM actually works and why this specific type of memory dominates AI economics: a B200 costs ~$6,400 to manufacture. the two logic dies (TSMC 4NP, ~800mm² each) are ~$900. the eight HBM3e stacks are ~$2,900. that puts HBM at around the ~45% manufacturing cost mentioned above. but let's start off with a simple question: what is High Bandwidth Memory (HBM) and why is it so expensive? a DRAM cell is one transistor and one capacitor. charge on the cap = 1, no charge = 0. it's the simplest possible memory cell, which is why DRAM is so dense. but reading is slow: the RC time constant of the bitline limits speed, and each read drains the capacitor, requiring refresh every 64ms. DDR5 runs at ~50 GB/s per channel. not enough for AI. HBM solves this by going wide, not fast. instead of pushing each pin to higher speeds, you make the bus HUGE. HBM3e runs a 1024-bit interface (vs DDR5's 64-bit). to fit 1024 wires between memory and GPU, you can't use a PCB. the traces would be too long, the crosstalk too severe. so you put memory on a silicon interposer, millimeters from the GPU, connected through thousands of copper-filled holes (TSVs) drilled through each stacked die. modest per-pin speed, extreme bus width, physical proximity. 8 stacks x 1024 bits x 9.2 Gbps = 8 TB/s on a B200. so then why is this THE bottleneck for AI? during LLM decode, every weight loads from HBM once per token for one multiply-accumulate. arithmetic intensity = 2N FLOPs / 2N bytes = 1 FLOP/byte. the H100's ridge point is ~295. so the GPU operates at 0.34% of peak compute during decode. adding tensor cores does nothing. only bandwidth helps. this is not a bug but the basic algebra of autoregressive generation. making HBM is brutal. dies ground to ~30um (thinner than a human hair), drilled with TSVs, stacked 8-16 high, tested at every layer. yield is 20-30% lower than DDR5. and producing 1 GB of HBM consumes the wafer area of 3 GB of DDR5. every HBM ramp tightens the entire DRAM market. SK Hynix controls ~60% of HBM and ~70% of NVIDIA's orders. their Q4 2025 operating margin: 58.4%, exceeding TSMC's 54%. the memory supplier captured more profit from the AI boom than the foundry. Micron tried to design their HBM4 base die in-house instead of using TSMC. the DRAM process couldn't route a 2048-bit interface or pass thermal specs. they lost essentially all Vera Rubin orders. HBM4 doubles the bus to 2048 bits and moves the base die to TSMC's logic process. for the first time, the bottom of the memory stack is custom silicon: advanced ECC, memory controllers, potentially in-memory compute. the line between the chip and the memory is slowly disappearing.
153

Emilio Andere

Tech & AI

2mo

we're hiring an MTS to work on GPU performance infrastructure with the leading AI chip and cloud companies. up to $200k base, 1-2% equity. we've been deploying our performance agent with 2 of the leading AI chip companies and 1 public cloud provider over the past month. we have demand and need smart folks to help us push our agent forward. the work is extremely hard and interesting. you'll learn how to read assembly, do real ml systems work, and reason about hardware architectures at the ISA level. if you've ever wished you could work on the lowest levels of AI infrastructure, this is it. we're seed backed by a leading deep tech fund, YC, Jeff Dean, and others. we're a 5 person team in-person (San Francisco) 5 days a week. our work environment is competitive, rewarding, and extremely fun. DM me or apply at https://lnkd.in/g4FJECBz
432

Emilio Andere

Tech & AI

2mo

swizzling is in every fast GPU kernel but rarely explained from first principles. the core idea is that one XOR per address eliminates bank conflicts with zero overhead. here's how it works from the ground up: shared memory on a B200 is 228 KB per SM. it's the fastest memory a kernel can touch. but a single SRAM array can only serve one read per cycle. if all 228 KB were one array, only one thread could read at a time. so the hardware splits it into 32 separate physical arrays called banks, each 4 bytes wide. consecutive addresses are spread across banks round-robin: bytes 0-3 in bank 0, bytes 4-7 in bank 1, up to bank 31, then back to bank 0. when 32 threads each hit a different bank, all 32 read in parallel, in 1 single cycle. when every thread in a warp hits a unique bank: full bandwidth. when two threads hit the same bank at different addresses: the hardware serializes them into separate passes. worst case is all 32 threads on one bank: 32 sequential reads instead of 1 parallel read. so where do conflicts actually happen? in every matrix multiply. a GEMM kernel loads a tile from global memory into shared memory. the write is row-major because global loads must be coalesced. but tensor cores consume data along the K dimension, which reads down columns. here's the main problem. if you store an fp16 tile with 64 columns, each row is 128 bytes. 128 / (32 banks x 4 bytes) = 1. every row starts at bank 0. so when you read any column, every row hits the same bank - a 32-way conflict! the warp stalls while the hardware processes each row one at a time. the original fix was padding. tile[32][33] instead of tile[32][32]. the extra element shifts each row by one bank. row 0 starts at bank 0, row 1 at bank 1, done. it's a simple trick, but it wastes shared memory, up to 25% for the narrow tiles tensor cores need. on Blackwell, where kernels compete for 228 KB per SM, wasting a quarter is not an option. swizzling fixes this with zero storage overhead. instead of storing element (row, col) at its natural position, store it at (row, col XOR row). the XOR flips bits of the row index into the column address. row 0 stores columns in order: 0, 1, 2, 3, 4, 5, 6, 7. row 1 shifts: 1, 0, 3, 2, 5, 4, 7, 6. row 2: 2, 3, 0, 1, 6, 7, 4, 5. each row is staggered like bricks in a wall. now when you read any single column, every row maps to a different bank with zero conflicts. the XOR is self-inverse. col XOR row XOR row = col. the same operation that scrambles the write also unscrambles the read. tcgen05 MMA reads the swizzled layout directly. if the layout isn't swizzled, the instruction literally won't execute. swizzling is inside every fast kernel shipping today. FlashAttention uses it for QK^T and PV. CUTLASS uses it for every tensor core path, and ThunderKittens abstracts it so users never see a bank conflict. head to the comments for the best technical resources on swizzling.
351

Emilio Andere

Tech & AI

2mo

Will be talking about why hardware companies with software moats are not safe, and what's actually needed to break the CUDA moat. Join us! SF, April 8th: https://lnkd.in/g9XXTcx8
143

Emilio Andere

Tech & AI

3mo

NVIDIA and AMD instructions lives in one ISA. Qualcomm's hexagon NPU runs three separate instruction sets inside a single core. here's why: a hexagon core contains three physically separate compute units, each with its own instruction encoding, register file, and data types. the scalar processor is a 4-wide VLIW with 1,800+ instructions and 32x32-bit registers. it handles control flow, DMA, and loop management. the vector unit (HVX) has 700+ instructions and 32 registers at 1024 bits each. it handles activations, softmax, and normalization. the tensor unit (HMX) has 125+ instructions and does matrix multiply on 32x32 tiles. three ISAs, three register files, three data type domains. the compiler packs all three into the same VLIW instruction bundle. every cycle, the core fetches one 128-bit packet containing up to four 32-bit instructions. those slots can mix scalar, vector, and tensor instructions freely. a single cycle can issue a DMA address computation, a softmax reduction, and a matrix multiply tile simultaneously. no other AI chip does this. NVIDIA puts everything in one ISA (PTX/SASS) and routes by opcode. AMD uses one ISA with different prefixes (s_ for scalar, v_ for vector, v_mfma_ for matrix). google's TPU packs 322-bit VLIW bundles with designated slots, but it's one instruction format. trainium runs four separate instruction streams on four separate engines. qualcomm uniquely packs three distinct ISAs into one core's instruction bundle. why three? because the throughput gap between tiers is 365x. HMX does 12 TFLOPS of FP16 matrix math. HVX does 33 GFLOPS of vector work. matmul and softmax need different register widths, different memory access rules, and different precisions. forcing them into one ISA wastes encoding space. the memory model reinforces the split. all three units share 8 MB of tightly coupled memory (TCM), software-managed SRAM. but HMX can only access TCM. HVX can access both TCM and L2 cache. the scalar unit gets L1, L2, and TCM. three units, three access rules, one shared scratchpad. before 2020, these units were physically separate blocks on the die. data handoff cost milliseconds. the hexagon 780 fused all three into a single core with shared TCM. handoff dropped to nanoseconds. that fusion enabled micro tile inferencing: 10+ neural network layers chained without data leaving the chip. the coprocessor model keeps power low. threads must explicitly request HVX or HMX access. 6 scalar threads share 4 vector contexts and 1-2 tensor units. threads that don't need matrix math never power on the tensor hardware. this is how qualcomm runs at 3-5 watts on a phone while hitting 45 TOPS. three ISAs sounds like unnecessary complexity, but it's the opposite! each tier is as simple as possible for its job. the scalar unit has no vector registers. the tensor unit has no branch prediction. the vector unit has no matrix accumulators. separation is the simplification.
466

Emilio Andere

Tech & AI

2mo

taalas permanently burns AI model weights into silicon transistors. one chip can run one model. i dug through their patents, expert discussions on hackernews, and prior research to understand how. here's everything i've found: ljubisa bajic built tenstorrent from scratch, grew it to 300 employees, then stepped down when jim keller took over as CEO in early 2023. six months later he co-founded taalas with two former AMD colleagues. the thesis is to stop simulating intelligence on general-purpose computers, and instead cast it directly into silicon. the HC1 came out of stealth last month. it's TSMC 6nm, has 53 billion transistors, and they hardwired llama 3.1 8B onto a single chip. no HBM, no liquid cooling, no CoWoS. air-cooled at 200 watts. on a GPU, every single token generation reads the entire model from HBM. that memory fetch is most of the time the bottleneck, not compute. taalas removes this, so that the weights are part of the circuit itself instead of being stored in memory. bajic says a single transistor stores 4 bits and performs the multiply. fully digital, not analog. he didn't reveal the circuit-level trick, but taalas's patent (US20250123802A1) describes the architecture. the base chip has shared multipliers that permanently compute all 16 possible products for a 4-bit weight: input times 0, input times 1, up to input times 15. each model weight is a single mask ROM transistor whose metal-layer connection routes it to the correct pre-computed product. the transistor doesn't do math - it picks which answer to pass through. the routing IS the multiply! this is why only 2 of the chip's 100 fabrication layers change per model. taalas pre-fabs base wafers in bulk, customizes those 2 metal masks when a new model arrives. because of this, their (alleged) turnaround is 2 months from weights to deployed silicon. this approach actually has a 50 year lineage. mask-programmable gate arrays in the 1970s pre-fabricated transistors and customized only metal layers. in 1989, yann lecun's team at bell labs built the ANNA chip with 6-bit weights doing OCR that matched 32-bit accuracy. intel acquired eASIC in 2018 for the same structured ASIC concept. a 2025 paper from the chinese academy of sciences independently validated metal-embedded weights for LLM inference. the tradeoffs are pretty severe. one model per chip. when that model is superseded, the chip is worthless. 3-bit quantization hurts quality (HC2 will move to 4-bit). KV cache in SRAM limits context length. scaling to frontier models across many chips is unproven, and no independent benchmarks exist (yet) beyond the live demo. but even treating chips as disposable, a $400 chip that outperforms a $30,000 GPU by 73x on that workload (taalas's own numbers) is the most interesting and unique bet i've come across in the AI hardware space so far. i'm personally super excited about what this team keeps building.
229

Emilio Andere

Tech & AI

3mo

google, microsoft, amazon, meta, nvidia, and openai all manufacture AI chips on TSMC 3nm. 6 companies, 6 chip architectures, but only 1 foundry. 1. google ironwood: 192GB HBM3e, 4.6 PFLOPS FP8 2. microsoft maia 200: 216GB HBM3e, 140B transistors, 10+ PFLOPS FP4 3. amazon trainium 3: 144GB HBM3e, 2.5 PFLOPS FP8 4. nvidia vera rubin: 288GB HBM4, 336B transistors, 50 PFLOPS FP4 5. meta MTIA (latest gen) 6. openai's custom chip via broadcom all on TSMC N3P, and all CoWoS packaging. but these chips share almost nothing architecturally! ironwood is a 256x256 systolic array. maia 200 is a monolithic die at near-reticle size (~840mm2). rubin is a multi-chiplet module with two compute dies. openai went with a pure inference design (that we know very little of). they don't even use the same CoWoS variant. rubin uses CoWoS-L: small silicon bridge chiplets embedded inside an organic interposer, connecting its two compute dies at 10 TB/s via NV-HBI. trainium 3 uses CoWoS-R: a purely organic RDL interposer with 6 layers of copper, with no silicon interposer at all. maia 200 uses standard CoWoS with a monolithic silicon interposer. the packaging is one of the hardest constraints in this process. CoWoS requires fabricating silicon interposers with TSVs, bonding known-good dies at 40um micro-bump pitch, then thinning the whole assembly to ~100um. a single defect destroys a package worth $50,000+ in components (compute dies + HBM stacks + interposer + substrate). TSMC is scaling from 35,000 CoWoS wafers/month in late 2024 to 130,000 by end of 2026, but nvidia alone has booked 60% of that capacity. broadcom gets 15%. AMD 11%. everyone else fights over 14%. TSMC holds ~90% of advanced logic chip manufacturing. demand exceeds supply by nearly 3x at advanced nodes. 3nm lines are 100% booked through 2026. all 3 HBM suppliers (SK hynix 57%, samsung, micron) are sold out through 2026. right now, you simply cannot design around TSMC. there is no alternative foundry at 3nm. samsung holds 7% foundry share. 90% of TSMC's production sits in taiwan, and it's costing $165 billion to replicate their arizona expansion alone for a project that TSMC's CEO himself has said is doomed to fail. and the dependency is only deepening. with HBM4, the memory base die moves from DRAM process to TSMC's logic foundry. SK hynix is fabricating HBM4 base dies on TSMC N3P. TSMC now makes the logic chips, the silicon interposers, the packaging assembly, and critical components of the memory. they call it "Foundry 2.0." for a chip like rubin, TSMC touches nearly every step of the physical stack. every ASIC that replaces an nvidia GPU is another customer competing for the same CoWoS wafers, the same 3nm lines, the same HBM allocation. TSMC doesn't care whether GPUs or ASICs win, because it currently makes both. their last quarter's gross margin was 62.3%, a record for them. TSMC is a truly fantastic company at the center of the AI chip revolution.
106

Emilio Andere

Tech & AI

2mo

Full break down of the Cerebras Wafer Scale Engine from first principles. the roofline math behind why it exists, the five engineering problems nobody else had solved, and more!
83

LLMday

Tech & AI

2mo

𝗪𝗲 𝗮𝗿𝗲 𝗸𝗶𝗰𝗸𝗶𝗻𝗴 𝗼𝗳𝗳 𝘁𝗵𝗲 𝗳𝗶𝗿𝘀𝘁 𝗟𝗟𝗠𝗱𝗮𝘆 𝗶𝗻 𝗦𝗮𝗻 𝗙𝗿𝗮𝗻𝗰𝗶𝘀𝗰𝗼. 🇺🇸 𝗔𝗽𝗿𝗶𝗹 𝟭𝟲, 𝟮𝟬𝟮𝟲 𝗮𝘁 𝗛𝗮𝗿𝗻𝗲𝘀𝘀. Large Language Models, AI, and Machine Learning take the stage. 𝗜𝗻𝘁𝗲𝗿𝗲𝘀𝘁𝗲𝗱 𝘁𝗼 𝗮𝘁𝘁𝗲𝗻𝗱? 𝗥𝗲𝗴𝗶𝘀𝘁𝗲𝗿 𝗵𝗲𝗿𝗲: https://lnkd.in/ggtkHVFP Meet the speakers and hear what they are building. Lior Schejter Marta Gajowa - Stealth Startup Aaron Zamora- Cribl Aoi M. - Aimoji Healthcare Radhika Gundavelli - Rescale Sandhya Subramani - Amazon Web Services Elizabeth Fuentes Leone - Amazon Web Services Titus Capilnean - Civic Technologies Dreema Patel- Adobe Sonali Priya - LB Networks James Duffy - Lantern Maciej Szymkowski, Ph.D. - Future Processing Kamal Kishore - Google Emilio Andere - Wafer George Job - Walmart Global Tech Sujithra Periasamy - Google Leon Adato - Cribl Falko Buttler - Lantern Julie Coorevits- Barco
20 pages
32

Emilio Andere

Tech & AI

3mo

Super grateful to have you onboard Max Buckley! 🥳
40