What I've read today 📚
Claude Opus 4.6 (https://lnkd.in/e3hspN75) — 1M context window, leads Terminal-Bench 2.0, beats GPT-5.2 by 144 Elo on knowledge work. The model they use to build the model.
DeepMind's cognitive framework for measuring AGI (https://lnkd.in/ebEt8QfW) — 10 cognitive ability taxonomy + $200K Kaggle hackathon to build the actual evals. At least they're being honest that we don't know how to measure this.
Mistral Forge (https://lnkd.in/emkFPVtW) — Train frontier-grade models on your proprietary data. ASML, ESA, Ericsson already building. The enterprise custom model market is getting real.
Nvidia Vera Rubin: 10x inference throughput per watt vs Blackwell (https://venturebeat.com) — Jensen calls it "the greatest infrastructure buildout in history." He's said that before. Still probably true.
CausalRM: RLHF reward modeling from clicks and upvotes (https://lnkd.in/eRcenCRy) — Instead of expensive human annotation, uses observational feedback with causal correction for noise and bias. 49.2% gain on safety benchmarks. Quietly important.
DyMoE: MoE inference on edge devices, 22.7x speedup in TTFT (https://lnkd.in/eFybEXzi) — Dynamic expert quantization that's actually depth-aware. Real numbers on commercial hardware.
F2LLM-v2: Multilingual embeddings, 200+ languages, 80M–14B, fully open-sourced (https://lnkd.in/eHZXt9zX) — #1 on 11 MTEB benchmarks. The kind of paper that quietly raises the floor for everyone.
VLA efficiency metrics are wrong (https://lnkd.in/eYueCK8u) — FLOPs and token throughput don't predict real robot execution cost. Methods that look efficient under standard metrics often make robots slower and jerkier. The benchmark problem extends to robotics.
Optimal pretraining vs specialization split for LLMs (https://lnkd.in/e96WwRWX) — Scaling laws for deciding how much compute to spend on general pretraining vs domain fine-tuning. Useful if you're building a vertical model.