EXEED AI

Paolo Perrone's Recent LinkedIn Posts

Paolo Perrone

Paolo Perrone

@paoloperrone

No BS AI/ML Content | ML Engineer with a Plot Twist ๐Ÿฅท100M+ Views ๐Ÿ“

en25 postsLinkedIn

Posts

Paolo Perrone

Tech & AI

2mo

"Andrej Karpathy built a complete GPT in 243 lines of Python. No PyTorch. No NumPy. No dependencies. Someone just made it interactive. Every building block runs live in your browser. Here's what 243 lines cover: 1๏ธโƒฃ Autograd from scratch A tiny Value class that tracks operations and computes gradients. Forward builds the graph. Backward flows gradients through it. No library. Pure chain rule. The interactive viz makes this click in seconds. 2๏ธโƒฃ The full GPT architecture in one function Token + position embeddings โ†’ RMSNorm โ†’ multi-head attention โ†’ MLP โ†’ residual โ†’ logits. n_embd=16, n_head=4, n_layer=1. The smallest transformer that captures every essential mechanic. 3๏ธโƒฃ Attention you can actually see Q, K, V projections. Scaled dot-product. Causal masking. Type any text and watch the attention heatmap update live. This is worth more than 100 diagrams. 4๏ธโƒฃ Training in 30 lines Forward โ†’ cross-entropy loss โ†’ backward() โ†’ Adam. Linear LR decay. 1,000 steps. Character-level prediction. Watch the loss curve drop in real time. 5๏ธโƒฃ Temperature you can feel Slide the temperature from 0.1 (focused) to 3.0 (random). Watch the probability distribution reshape. The model generates names that never existed in the data. Every line is the algorithm. Everything else โ€” in every other codebase โ€” is just efficiency. This is what $5,000 ""intro to transformers"" courses try to teach. 243 lines. $0. Interactive. No setup. How many engineers on your team actually understand what happens inside a transformer? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. It's the fastest path from ""I use transformers"" to ""I understand them."""
90

Paolo Perrone

Tech & AI

2mo

Most teams deploy LLMs with default settings and wonder why inference costs $50K/month. The optimization stack exists. Most engineers don't know the layers. Here's the full inference optimization hierarchy: LAYER 1: Serving architecture Before touching a single kernel, get your serving right. vLLM (74K โญ): PagedAttention, continuous batching. https://lnkd.in/eeT_HM2B SGLang (25K โญ): structured generation + RadixAttention. Faster for constrained outputs. https://lnkd.in/eKK7sxdf LAYER 2: Quantization Shrink the model without killing accuracy. llama.cpp (92K โญ): GGUF quantization. Run 70B on consumer hardware. https://lnkd.in/eJrUg_qd Unsloth (50K โญ): QLoRA fine-tuning at 70% less VRAM. https://lnkd.in/gJZtH4Y4 This layer alone can cut your GPU bill in half. LAYER 3: Attention + caching How much are you spending on redundant prefill? Flash Attention (21K โญ): memory-efficient, IO-aware. Non-negotiable. https://lnkd.in/eYkuRuxC LMCache (1.5K โญ): KV cache sharing. Eliminates it entirely. github.com/LMCache/LMCache LAYER 4: Hardware-specific acceleration Match your optimization to your silicon. TensorRT-LLM: purpose-built for NVIDIA GPUs. Kernel fusion, in-flight batching. https://lnkd.in/ekuFuDAP MLX: native framework for Apple Silicon. Inference without CUDA. github.com/ml-explore/mlx LAYER 5: Custom kernels Where the real differentiation lives. LeetCUDA (9K โญ): 200+ CUDA kernels. Tensor Cores, HGEMM. https://lnkd.in/eUfgpwW6 llm.c (28K โญ): Karpathy's raw C/CUDA. The fundamentals. github.com/karpathy/llm.c Engineers who write custom kernels command $200K+ at NVIDIA, Meta, and Google. LAYER 6: Distributed inference When one node isn't enough. NVIDIA Dynamo: multi-node orchestration. Disaggregated serving. https://lnkd.in/etBGNtjk exo (39K โญ): distributed inference across consumer devices. github.com/exo-explore/exo 6 layers. Each one multiplies the savings from the layer above. Most teams stop at Layer 1. The ones running inference profitably reach Layer 5. Which layer is your team stuck at? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. Your next inference bill will thank you.
129

Paolo Perrone

Tech & AI

2mo

Claude Code has a memory problem. Every session starts from zero. Claude-Mem just fixed it. 40K GitHub stars in weeks. Here's what it does: Every tool call, every observation, every decision Claude makes during your coding session gets captured, compressed, and stored. Next session: relevant context gets injected automatically. No manual intervention. No copy-pasting previous conversations. No "let me re-explain the entire codebase." How it works under the hood: 1๏ธโƒฃ 5 lifecycle hooks capture everything SessionStart, UserPromptSubmit, PostToolUse, Stop, SessionEnd. Every action becomes a searchable observation. 2๏ธโƒฃ AI-powered compression Raw observations get summarized into semantic memory. SQLite for storage. Chroma vector DB for hybrid search. Full-text + semantic retrieval. 3๏ธโƒฃ Progressive disclosure saves tokens Search returns a compact index (~50-100 tokens per result). Fetch full details ONLY for what's relevant. 10x token savings vs dumping full context every time. 4๏ธโƒฃ One command to install /plugin marketplace add thedotmack/claude-mem /plugin install claude-mem Restart Claude Code. Memory is live. Also works with OpenClaw gateways. Web viewer UI at localhost:37777. Real-time memory stream. Citation support for past observations. 40.2K stars. 2.9K forks. 76 contributors. 216 releases. Built by Alex Newman. Open source (AGPL-3.0). The engineers using Claude Code without persistent memory are rewriting the same context every morning. Which Claude Code plugin has changed your workflow the most? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose Claude Code sessions keep starting from scratch.
186

Paolo Perrone

Tech & AI

2mo

Stack Overflow copy-paste was the original vibe coding. We just upgraded the source.
790

Paolo Perrone

Tech & AI

2mo

The gap between "I'm learning AI" and "I ship AI systems" is exactly 12 resources. Bootcamps charge $5,000+ to teach worse versions of them. Why do people still pay? Here's the free path: STEP 1: Build the Foundation 1๏ธโƒฃ Google DeepMind AI Research Foundations โ†’ 8-course sequence building language models from scratch. Taught by Gemini Lead Oriol Vinyals. This is what replaced university ML courses for me. https://lnkd.in/ewA8sQK4 2๏ธโƒฃ LLMs from Scratch (Raschka, 89K โญ) โ†’ build a ChatGPT-like model in PyTorch: pretraining, SFT, RLHF. If you only pick ONE resource from this list, pick this one. https://lnkd.in/eYS8NMw5 STEP 2: Build a Reasoning Model 3๏ธโƒฃ LLM Course (Labonne, 74K โญ) โ†’ 3-track roadmap with Colab notebooks: quantization, RAG, deployment https://lnkd.in/gVv2Spr2 4๏ธโƒฃ Reasoning from Scratch (Raschka, 3.6K โญ) โ†’ build a reasoning LLM on consumer GPUs mirroring DeepSeek R1 https://lnkd.in/eyUjBf44 STEP 3: Ship It 5๏ธโƒฃ MLOps Zoomcamp (DataTalksClub, 14K โญ) โ†’ free 9-week cohort: MLflow, monitoring, CI/CD, Terraform https://lnkd.in/eaSVBwrX 6๏ธโƒฃ LLM Twin Course (Decoding ML, 5K โญ) โ†’ production LLM + RAG end-to-end: QLoRA, Qdrant, AWS https://lnkd.in/ecy5ppfK STEP 4: Master RAG 7๏ธโƒฃ RAG Techniques (Diamant, 11K โญ) โ†’ 30+ methods: Graph RAG, Agentic RAG, Milvus https://lnkd.in/ezx_5nUk 8๏ธโƒฃ Anthropic Courses โ†’ tool use, prompt engineering, evaluations, MCP. Free with certificates. https://lnkd.in/ej-PySxQ STEP 5: Deploy Agents 9๏ธโƒฃ Learn Agentic AI (Panaversity, 10K โญ) โ†’ OpenAI Agents SDK, MCP, A2A, Kubernetes deployment https://lnkd.in/eH3vWgve ๐Ÿ”Ÿ HuggingFace Agents Course โ†’ smolagents + LlamaIndex + LangGraph with challenges https://lnkd.in/gxSFVqhY STEP 6: Optimize Inference 1๏ธโƒฃ1๏ธโƒฃ vLLM (45K โญ) โ†’ the standard for LLM serving: PagedAttention, continuous batching, multi-GPU. If you're serving models in production and not using this, you're overpaying. https://lnkd.in/eeT_HM2B 1๏ธโƒฃ2๏ธโƒฃ LeetCUDA (9K โญ) โ†’ 200+ CUDA kernels: Tensor Cores, Flash Attention, HGEMM. The skill that separates senior from staff. https://lnkd.in/eUfgpwW6 6 steps. 12 resources. $0. Which step is your team at right now? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone still stuck in tutorial hell.
98

Paolo Perrone

Tech & AI

2mo

NVIDIA just open-sourced the security layer for always-on AI agents. Open source. Just shipped at GTC. It's called NemoClaw. Here's why it matters: OpenClaw exploded. 5,000+ skills. Millions of users. Agents running 24/7. Who's actually controlling what they access? The agent reads your files, calls your APIs, executes code. If it misbehaves, your entire security model is a string of text the agent can ignore. NemoClaw fixes this at the runtime level: 1๏ธโƒฃ Every agent runs in a sandbox Not a Docker container you configured. A managed sandbox via OpenShell. File access limited to specific directories. Network calls filtered through policy. 2๏ธโƒฃ Model inference goes through a gateway The agent never talks to the model provider directly. Credentials and endpoints are invisible to the agent. Run locally with Nemotron or route to cloud. Your choice. 3๏ธโƒฃ Versioned environment blueprints No more ad hoc agent setups. The entire environment is defined, verified, and applied consistently. If something breaks, you roll back the blueprint. Not the agent. 4๏ธโƒฃ One command to install curl -fsSL https:// nvidia.com/ nemoclaw.sh | bash Installs dependencies, integrates with OpenClaw, launches guided setup. Agent running inside a controlled environment in minutes. OpenClaw = what agents CAN do. NemoClaw = what agents are ALLOWED to do. 100% open source. Announced at GTC 2026. Are you running always-on agents without a security layer? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose agents have more permissions than their interns.
142

Paolo Perrone

Tech & AI

2mo

I've replaced $4,000/month in LLM infrastructure costs with 8 open-source repos. $48,000/year. Gone. Here's the swap list: 1๏ธโƒฃ Paid serving API โ†’ vLLM (74K โญ) Self-hosted inference. PagedAttention, continuous batching. $0.03/token โ†’ $0.002/token overnight. https://lnkd.in/eeT_HM2B 2๏ธโƒฃ Cloud fine-tuning platform โ†’ Unsloth (50K โญ) 2x faster. 70% less VRAM. Single A100. Replaced an $800/month service. https://lnkd.in/gJZtH4Y4 3๏ธโƒฃ Paid transcription API โ†’ whisper.cpp (45K โญ) OpenAI Whisper in C/C++. Runs locally. Was paying $0.006/minute ร— 200K minutes. $1,200/month โ†’ $0. https://lnkd.in/ehNtjbSi 4๏ธโƒฃ Expensive GPU instances โ†’ llama.cpp (92K โญ) GGUF quantization. 70B models on consumer hardware. Dev and testing moved from cloud to MacBooks. https://lnkd.in/eJrUg_qd 5๏ธโƒฃ Default attention โ†’ Flash Attention (21K โญ) 40% VRAM reduction on long context. Non-negotiable. Every serving framework uses it. Do you understand WHY it works? https://lnkd.in/eYkuRuxC 6๏ธโƒฃ Commercial dev environment โ†’ Ollama (158K โญ) One command to run any model locally. Replaced a $200/month tool for the team. github.com/ollama/ollama 7๏ธโƒฃ $2,000 CUDA course โ†’ LeetCUDA (9K โญ) 200+ CUDA kernels. Tensor Cores, Flash Attention, HGEMM. Free. Better than anything I've paid for. https://lnkd.in/eUfgpwW6 8๏ธโƒฃ ""Understanding transformers"" bootcamp โ†’ llm.c (28K โญ) Karpathy's LLM training in raw C/CUDA. Taught me more about what PyTorch hides than any course. github.com/karpathy/llm.c $4,000/month โ†’ $200/month. 95% reduction. Same output. 530K+ combined stars. All free. Which swap would save your team the most? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this before your next infrastructure review.
757

Paolo Perrone

Tech & AI

2mo

A YC-backed CTO told me his team deleted every engineering metric except two. Revenue per line of code. Revenue per token. His 4-person team outships companies with 40 engineers. He says these metrics are why. Here's what they actually track: 1๏ธโƒฃ Revenue per line of code = lifetime cost of ownership Every line you ship has a maintenance cost. A review cost. A debugging cost. A security cost. AI writes 10x more lines. Most of them shouldn't exist. 500 lines generating $200K/year beats 50,000 lines generating $200K/year. Same revenue. 100x the liability. 2๏ธโƒฃ Revenue per token = cash-on-cash return on AI spend Your Claude bill. Your GPT bill. Your inference costs. Every token has a dollar cost. Every dollar should trace to revenue. $4,000/month in tokens powering $400K/month in features = 100x return. $4,000/month in tokens powering prototypes that never ship = $0 return. 3๏ธโƒฃ Complexity is the metric nobody watches More code = more surface area. More dependencies. More debt. AI makes generating complexity free. Maintaining it isn't. Revenue per line forces one question: does this code EARN its place in the codebase? The team that ships the least code generating the most revenue wins. Not the team that ships the most code the fastest. His exact words: "Velocity is a vanity metric. Revenue per line is a survival metric." Are you tracking what your code actually earns? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose engineering metrics still reward lines shipped.
145

Paolo Perrone

Tech & AI

2mo

The best agentic coding workflow I've seen this year. You take the day shift. Claude Code takes the night shift. Here's how it works: ๐ŸŒž DAY SHIFT (you): Gather requirements. Write specs. Think through architecture. No agents running. No babysitting. No context switching. AI only used in short "ask" mode: find info, be concise, get out. Every completed spec goes into a ./Specs folder. Draft specs are prefixed draft-* so agents ignore them. ๐ŸŒ“ NIGHT SHIFT (agents): Load Claude Code, Cursor, or Codex. Point it at an AGENT_LOOP.md file that explains the workflow. An AGENTS.md file (~150 lines) routes the agent to docs, skills, and validations. Then you go to sleep. While you're away, the agent: 1๏ธโƒฃ Cleans the working tree (stash or commit uncommitted work) 2๏ธโƒฃ Runs the full test suite and fixes any failures 3๏ธโƒฃ Picks a task from bugs first, then specs 4๏ธโƒฃ Implements it with tests and docs 5๏ธโƒฃ Commits and moves to the next task MORNING REVIEW: You come in. Read the changelog. Go commit by commit. Review every diff, every test, every doc change. If something's wrong, you DON'T fix the code. You fix the DOCS that caused the agent to write wrong code. That's the key insight: every morning review improves the next night shift. Credit: Jamon Holmgren. Which part of your workflow could you hand to the night shift? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone still babysitting agents all day.
132

Paolo Perrone

Tech & AI

2mo

I've been training models on GPU clusters for years. I couldn't explain why Tensor Parallelism stops scaling at 8 GPUs. HuggingFace ran 4,000+ scaling experiments on up to 512 GPUs and wrote everything down. Open source. The Ultra-Scale Playbook. Here's what humbled me: 1๏ธโƒฃ Activation memory is the silent killer I knew it grew with batch size. I didn't know it grows quadratically with sequence length. At 8K context, activations dwarf parameters, gradients, and optimizer states combined. One interactive widget made this click in seconds. 2๏ธโƒฃ ZeRO-3 and Pipeline Parallelism solve the same problem One communicates weights. The other communicates activations. I was combining them when I shouldn't have been. The benchmarks show exactly when each one wins. 3๏ธโƒฃ That 43% throughput drop from TP=8 to TP=16 Tensor Parallelism uses NVLink inside a node. Cross-node requires slower interconnects. I knew TP didn't scale well cross-node. I didn't know it was THIS bad. 4๏ธโƒฃ Pipeline bubbles have been nearly eliminated 1F1B, interleaved stages, DualPipe, zero-bubble scheduling. DeepSeek-V3 splits backward passes into input gradients and weight gradients to fill the bubble. I was still using naive PP. Embarrassing. 5๏ธโƒฃ FP8 training already works at scale DeepSeek-V3 did it. Per-tile quantization: 1x128 for activations, 128x128 for weights. 50% memory reduction. Not theoretical. Production. Reading time: 2-4 days. Worth every hour. Which section would have surprised you the most? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. You'll come back every time you scale past one node.
204

Paolo Perrone

Tech & AI

2mo

LLMs are hard-wired to agree with you. Most people complain about this. I weaponize it. If you say "find me a bug," the agent will find one. Even if it has to invent it. That's sycophancy. And it's the most exploitable feature in your toolkit. Here's the 3-agent pattern that turns it into near-flawless bug detection: 1๏ธโƒฃ The bug-finder (biased to over-report) Tell it: +1 for low-impact bugs, +5 for medium, +10 for critical. It goes full enthusiast. Reports a score of 104. Finds bugs that aren't bugs. That's the point. You want the superset of all possible bugs first. 2๏ธโƒฃ The adversarial agent (biased to disprove) Tell it: earn the score of every bug you disprove. Get it wrong, lose 2x. Now sycophancy works in reverse. This agent wants to please you by destroying findings. It aggressively kills false positives. But with caution. 3๏ธโƒฃ The referee (biased toward accuracy) Tell it: I have the ground truth. +1 for correct calls, -1 for wrong ones. You don't have ground truth. You're lying. The lie makes it careful. Whatever survives three rounds of competing sycophancy is almost certainly real. Why this works: Each agent exploits what LLMs are hard-wired to do. The finder wants to find. The adversary wants to disprove. The referee wants to be right. Three sycophants competing. That's the system. One more trick: neutral prompts for when you DON'T want bias. Don't say: "Find me a bug in the database." Say: "Search through the database, follow the logic of each component, report all findings." First biases toward inventing problems. Second lets the agent report what's actually there. You don't fix sycophancy. You channel it. One direction for hunting. The opposite direction for verification. A neutral path for honest reporting. Are you still asking one agent to "find bugs in my code"? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose entire QA process is a single prompt.
60

Paolo Perrone

Tech & AI

2mo

Google just published an algorithm that makes LLM inference 6x cheaper. No retraining. No new hardware. Software only. Memory stocks dropped the same day. Here's why Wall Street panicked: Every LLM stores a key-value cache that grows with every token. Longer context = more VRAM. This cache is the bottleneck, not the model weights. TurboQuant compresses it from FP16 down to 3 bits. Zero accuracy loss. The benchmarks (NVIDIA H100): 6x smaller KV cache 8x faster attention computation 100% recall on needle-in-haystack at 104K tokens No training required. Drop-in compression. Cloudflare's CEO called it "Google's DeepSeek moment." Here's what changes: 1๏ธโƒฃ 100K-1M token context becomes practical The KV cache at 128K tokens was eating entire GPUs. 6x compression means the same $30K GPU handles 6x the context. Inference that cost $0.90 per 100K-token session drops to $0.15. 2๏ธโƒฃ On-device inference gets real Already replicated on Apple Silicon via MLX. 4.9x compression. Implemented in 25 minutes using GPT-5.4. Laptops become inference targets. 3๏ธโƒฃ The HBM demand curve just bent Micron and Western Digital fell at market open. If software alone cuts memory needs 6x, every GPU purchase order gets revisited. NVIDIA's next earnings call will address this. 4๏ธโƒฃ The mechanism is elegant PolarQuant converts vectors to polar coordinates. Angles are predictable. Compress them hard. QJL corrects the residual error with 1 bit. Zero bias on inner products. Two stages. Near-optimal. No codebook training needed. Same model. Same outputs. 6x less memory. 8x faster attention. Google published the paper. The community implemented it in hours. Are you still running inference without KV cache compression? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone paying for VRAM that TurboQuant just made unnecessary.
197

Paolo Perrone

Tech & AI

2mo

An LLM-as-a-judge that agrees with itself is not an eval. Most teams find out when the VP of Sales forwards an angry email from their largest customer. I got an early look at the product that's replacing eval dashboards with agents that train themselves. Here's how Basalt works: 1๏ธโƒฃ Let the outcome be the eval Your LLM-as-a-judge scores the response 4.2 out of 5. Your customer reopened the ticket 20 minutes later. Which one do you trust? Track what users DO after the response. Converted? Resolved? Bounced? Called a human? That's your eval. Everything else is a simulation. 2๏ธโƒฃ Collapse the eval step into the build step Before: agent fails โ†’ you find the trace โ†’ you file a ticket โ†’ someone tweaks the prompt โ†’ you rerun evals โ†’ you ship. Now: agent fails โ†’ failures surface automatically โ†’ coding agent fixes it โ†’ you approve โ†’ shipped. The eval pipeline doesn't get better. It gets absorbed into development. 3๏ธโƒฃ Fix one thing without breaking three others Every engineer who's tweaked a system prompt knows this: fix the refund response, break the escalation flow. Fix escalation, break the tone for enterprise accounts. Self-learning agents test against every previous fix before deploying the next one. Prompt whack-a-mole ends here. One approach tells you what's wrong. Basalt fixes it. ๐Ÿ’ธย The costs you should track: A senior AI engineer costs $250K/year. Most spend 15-20% of their time on evals that never improve the agent. That's $40-50K/year per engineer you get back. Are you still building eval dashboards nobody checks? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose agent improvement process is spreadsheets and prayers.
117

Paolo Perrone

Tech & AI

2mo

GPU engineers who write CUDA command $200K+ at NVIDIA, Meta, and Google. Most AI engineers can't write a single kernel. Jeremy Howard's shortcut: write it in Python. Paste into ChatGPT. Compile from a notebook. Flash Attention, GPTQ, AWQ, quantization. None of these can be written in PyTorch alone. CUDA is the skill gap. 1๏ธโƒฃ Write the kernel logic in pure Python No CUDA. No C. Just PyTorch tensors and for loops. Debug it. Print statements. Step through it. Make sure it works. 2๏ธโƒฃ Paste it into ChatGPT: "Convert to equivalent C code" It gets 95% right. You fix data types and add semicolons. The Python and C versions look nearly identical. 3๏ธโƒฃ Use PyTorch's load_inline to compile and run No build scripts. No terminal. No Makefiles. Works inside a Jupyter notebook. Even in free Colab. He demos this with two examples: RGB to grayscale: Python: 1.5 seconds for 34K pixels. CUDA: 1 millisecond for 1.7M pixels. Matrix multiplication: Optimized CPU: 1.3 seconds for 392M operations. CUDA: 6 milliseconds. Same operations. The mental model that makes CUDA click: Your kernel = the inner loop of your Python code. CUDA runs that inner loop 10,000+ times in parallel. Blocks and threads are just nested for loops with indices. That's it. The rest is syntax. CUDA stopped being scary the moment Jeremy said "write it in Python first." Which CUDA kernel would you try writing first? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. You'll need it when you finally stop avoiding CUDA.
163

Paolo Perrone

Tech & AI

2mo

You've bookmarked 30 GPU programming resources. You've finished zero of them. Wafer just organized the entire stack into one curriculum. Open source. Free. Here's what it covers, start to finish: 1๏ธโƒฃ CUDA from scratch PMPP textbook. GPU Mode lectures (23K+ Discord). The starting point every GPU engineer at NVIDIA, Meta, and Google went through. 2๏ธโƒฃ Matrix multiplication deep dives How to optimize a CUDA matmul for cuBLAS-like performance. DeepSeek's FP8 GEMM: ~300 lines. Production-ready. The topic that separates senior from staff in GPU interviews. 3๏ธโƒฃ FlashAttention from first principles Original paper โ†’ FlashAttention-2 โ†’ FlashAttention-3 on Hopper. The kernel that changed inference economics. Do you understand how it actually works? 4๏ธโƒฃ Tensor Cores and mixed precision WMMA โ†’ MMA โ†’ WGMMA โ†’ tcgen05. Volta through Blackwell. FP8 training. NVFP4 inference. The formats cutting GPU bills in half. 5๏ธโƒฃ Production inference at scale vLLM (74K โญ). SGLang (25K โญ). TensorRT-LLM. Continuous batching. Speculative decoding. KV cache optimization. Then NCCL, Megatron-LM, Meta's NCCLX at 100K+ GPUs. 6๏ธโƒฃ LLM-generated kernels Stanford's KernelBench. Meta's KernelLLM: beats GPT-4o at writing CUDA. DeepMind's AlphaEvolve: 32.5% FlashAttention speedup. This section didn't exist a year ago. From ""what is a kernel"" to ""how Meta trains at 100K+ GPUs."" One repo. Free. No excuses left. Delete the other 29 bookmarks. Which section are you reading first? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. But actually open it this time.
110

Paolo Perrone

Tech & AI

2mo

Every AI Engineering Roadmap gives you the same generic advice. "Learn Python. Take a deep learning course. Build projects." That's not a roadmap. That's a to-do list that ignores where you're starting from. A software engineer doesn't need the same path as a data scientist. So I built three personalized roadmaps: ๐—ฆ๐—ผ๐—ณ๐˜๐˜„๐—ฎ๐—ฟ๐—ฒ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ โ†’ ๐—”๐—œ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ (๐Ÿฒ ๐—บ๐—ผ๐—ป๐˜๐—ต๐˜€) You already have production instincts. Skip the basics. Month 1: LLM fundamentals. Month 2-3: RAG and retrieval. Month 4-5: Agents. Month 6: Ship something real. ๐— ๐—Ÿ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ โ†’ ๐—”๐—œ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ (๐Ÿฎ-๐Ÿฏ ๐—บ๐—ผ๐—ป๐˜๐—ต๐˜€) You're closer than you think. Skip the theory you already know. Focus on the application layer: prompt engineering, eval frameworks, agent orchestration. ๐——๐—ฎ๐˜๐—ฎ ๐—ฆ๐—ฐ๐—ถ๐—ฒ๐—ป๐˜๐—ถ๐˜€๐˜ โ†’ ๐—”๐—œ ๐—˜๐—ป๐—ด๐—ถ๐—ป๐—ฒ๐—ฒ๐—ฟ (๐Ÿฒ-๐Ÿต ๐—บ๐—ผ๐—ป๐˜๐—ต๐˜€) You have the math. You need the engineering. Longer runway, but every month is mapped: what to learn, what to build, what to skip. Each roadmap pulls from 5 learning tracks: LLM Fundamentals โ†’ RAG & Retrieval โ†’ AI Agents โ†’ Hardware & Inference โ†’ Security & Evaluation. Every track has: โ†’ Sequenced resources (what to learn, in what order) โ†’ Skip lists (what NOT to learn) โ†’ Checkpoints to verify you're ready to move on 34 pages. 100% free. No fluff. ๐Ÿ”— Grab it from The AI Engineer https://lnkd.in/eiFQcT6v Which path fits you? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone stuck following generic AI advice that doesn't fit their background.
64

Paolo Perrone

Tech & AI

2mo

Most AI models understand text. This one understands customer relationships. 40+ signals extracted from every support interaction: frustration, urgency, churn risk, sentiment drift, escalation probability. Not from structured data. From raw conversations: tickets, voice calls, chat transcripts, emails. In real time. Across every channel. Here's why the engineering matters: 1๏ธโƒฃ Ensemble architecture, not a wrapper SupportLogic doesn't call GPT and hope for the best. Custom-trained domain-specific SLMs + LLMs running in a single-tenant VPC. Their precision RAG outperforms public LLM benchmarks on enterprise support data. 2๏ธโƒฃ Signal extraction at scale nobody talks about 40+ distinct signals per interaction. Not keyword matching. Detecting frustration buried in technical jargon across a 47-email thread. That's a harder NLP problem than most chatbot companies will ever solve. 3๏ธโƒฃ Zero-copy data architecture No data duplication. No data leaving your environment. Differential sync. Schema normalization across Salesforce, Zendesk, ServiceNow, Dynamics. SOC 2 Type II, ISO 27001, HIPAA compliant. 4๏ธโƒฃ MCP server with zero-trust gateway Every tool call: authenticated, authorized, policy-checked before execution. Per-tool permissions: an agent that reads sentiment CAN'T trigger case reassignment. Anomaly detection on tool call patterns across all connected clients. One-line install: npx add-mcp https://lnkd.in/e3EUPwtW What that looks like in production: Platform outage hits at 2 AM. 600 tickets by morning. SupportLogic's extract_signals runs across the entire backlog in parallel. Clusters by sentiment, urgency, and customer tier. Fortune 500 customer blocked on production gets routed to a senior agent. Password reset request stays in the queue. Automatically. No human triaging 600 tickets at 7 AM. The result: Salesforce, Databricks, HPE, Qlik, Fivetran, and Rubrik use it in production. 40% fewer escalations. 100% QA coverage. Zero additional headcount. Built by Krishna Raj Raja. First support engineer VMware ever hired in India. He saw the problem 20 years before anyone thought AI could solve it. What's the hardest NLP problem your team has tackled in production? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone building AI on unstructured enterprise data.
43

Paolo Perrone

Tech & AI

2mo

I ran the same financial research agent on 5 frameworks. Same task. Same model. Same output. One cost $4.93 per run. Another cost $0.38. Here's why the gap is 13x: The expensive framework stuffed everything into context. 40 skill definitions. Raw 60K-token SEC filings. Prior conversation history. 137K tokens per LLM call before the agent started thinking. The cheap one loaded skill metadata only. 100 tokens per skill instead of 2,000. Summaries from disk instead of raw documents. 10.5K tokens per call. Same results. I tested all 5: 1๏ธโƒฃ AutoGen: conversation-based coordination 4 agents ร— 20 messages ร— 500 tokens = 40K tokens per round. 5 rounds = 200K tokens on agents reading each other's mail. The project also fractured into 3 competing codebases. Good luck picking one. 2๏ธโƒฃ CrewAI: 30 lines to a demo, chaos in production Same input routes to different agents on different runs. Their own blog: "Start with 100% human review. Work down to 50%." That's not a best practice. That's a confession. 3๏ธโƒฃ LangGraph: control you pay for in boilerplate 200+ lines of setup before your first agent runs. Refactor the graph โ†’ state schemas break โ†’ checkpoints invalid. 4๏ธโƒฃ DeerFlow: composable infrastructure Progressive skill loading. Filesystem-first state. 9-module middleware pipeline. $0.38 per run. 13x cheaper. The architecture that actually scales. 5๏ธโƒฃ Anthropic: no framework at all 6 composable patterns. Wire them yourself. Maximum control. Zero guardrails. Liberating with infrastructure engineers. A detour without them. The pattern: composable beats monolithic. Every time. Load only what you need. Store state on disk, not in tokens. Make middleware modular. 50 research runs/day: $570/month composable vs $7,400/month monolithic. Which framework is your team bleeding tokens on? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this before your next agent architecture decision.
33

Paolo Perrone

Tech & AI

2mo

Anthropic just had their ENTIRE Claude Code codebase leaked. All of it. 1,900 files. 500,000 lines. The full CLI architecture is now free to read. A source map in a public npm package reportedly pointed to a downloadable archive. Here's why every agent engineer should care: 1๏ธโƒฃ The stack nobody expected TypeScript. Bun as runtime. React Ink for the terminal UI. Not Python. Not Rust. A web stack powering the most advanced coding CLI on the market. 2๏ธโƒฃ Agent orchestration Full multi-agent coordination. Background sessions. The agent doesn't just execute: it delegates, manages, and resumes. 3๏ธโƒฃ Permission logic Every tool call goes through a permission framework. This is how Claude Code decides what it's ALLOWED to do without prompting you every 30 seconds. 4๏ธโƒฃ Memory handling The architecture behind rules, skills, and CLAUDE.md. How it remembers context across sessions. What engineers treat as magic now has a readable spec. 5๏ธโƒฃ ~80-90 feature flags The roadmap Anthropic didn't publish: โ†’ Autonomous agent modes โ†’ Multi-agent coordination โ†’ Voice capabilities โ†’ Background sessions โ†’ Internal prompt variations Which of these ships first? 6๏ธโƒฃ IDE integrations How Claude Code talks to VS Code, JetBrains, and external services. The full plugin architecture exposed. 7๏ธโƒฃ The tool framework The system that lets Claude Code run bash, edit files, and search codebases. Every tool has a spec. Every spec is now readable. Engineers pay $20/mo for the product. The architecture behind it is now $0 to inspect. If you're building agents, this is a free masterclass in how a frontier lab structures a production CLI. Which reveal are you reading first? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone building agents who needs to see this ๐Ÿ“š
42

Paolo Perrone

Tech & AI

2mo

An LLM-as-a-judge that agrees with itself is not an eval. Most teams find out when the VP of Sales forwards an angry email from their largest customer. I got an early look at the product that's replacing eval dashboards with agents that train themselves. Here's how Basalt works: 1๏ธโƒฃ Let the outcome be the eval Your LLM-as-a-judge scores the response 4.2 out of 5. Your customer reopened the ticket 20 minutes later. Which one do you trust? Track what users DO after the response. Converted? Resolved? Bounced? Called a human? That's your eval. Everything else is a simulation. 2๏ธโƒฃ Collapse the eval step into the build step Before: agent fails โ†’ you find the trace โ†’ you file a ticket โ†’ someone tweaks the prompt โ†’ you rerun evals โ†’ you ship. Now: agent fails โ†’ failures surface automatically โ†’ coding agent fixes it โ†’ you approve โ†’ shipped. The eval pipeline doesn't get better. It gets absorbed into development. 3๏ธโƒฃ Fix one thing without breaking three others Every engineer who's tweaked a system prompt knows this: fix the refund response, break the escalation flow. Fix escalation, break the tone for enterprise accounts. Self-learning agents test against every previous fix before deploying the next one. Prompt whack-a-mole ends here. One approach tells you what's wrong. Basalt fixes it. ๐Ÿ’ธย The costs you should track: A senior AI engineer costs $250K/year. Most spend 15-20% of their time on evals that never improve the agent. That's $40-50K/year per engineer you get back. Are you still building eval dashboards nobody checks? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose agent improvement process is spreadsheets and prayers.
62

Paolo Perrone

Tech & AI

2mo

How to fine-tune LLMs in 2026. The complete guide. Everything changed this year. The old way: thousands of labeled examples, custom reward functions, weeks of curation. The new way: no labeled data, no reward functions, a 3B model that outperforms GPT at 1/100th the cost. Here's the full stack: 1๏ธโƒฃ SFT is not enough for agents Supervised fine-tuning teaches what to say. Not how to succeed. For agents that search, call APIs, and reason across steps, you need reinforcement learning. 2๏ธโƒฃ GRPO: the algorithm behind DeepSeek-R1 Generate N completions for each prompt. Score them relative to each other. Reinforce above-average behaviors. Suppress below-average ones. No separate reward model needed. 3๏ธโƒฃ RULER: zero-data reward signals Asking an LLM "rate this 0-10" = inconsistent. Asking "which of these 4 attempts best achieved the goal?" = reliable. GRPO only needs relative scores. RULER provides them. No reward functions to write. No labeled data to collect. 4๏ธโƒฃ ART: 100% open-source framework The practical layer that makes all of this work. Native tool calls. Multi-turn agent support. LangGraph, CrewAI, ADK integrations. vLLM for inference. Unsloth-powered GRPO for training. New LoRA checkpoint loads after each step. Automatically. 5๏ธโƒฃ The workflow Provide any tool server URL. ART queries the tools, generates training tasks, runs GRPO with RULER evaluation. Your model improves with every cycle. No manual intervention. Stop paying $0.03/token for tasks a small model handles better. Which task would you fine-tune first? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone who thinks fine-tuning still requires months of data work.
27

Paolo Perrone

Tech & AI

2mo

A 7B model just beat torch.compile at writing CUDA kernels. Not by memorizing patterns. By learning through trial and error. 92% faster on the hardest benchmarks. Claude Opus 4.5 and Gemini 3 Pro sit 40% behind it. Here's how a 7B model outperformed frontier models 100x its size: 1๏ธโƒฃ Write โ†’ run โ†’ measure โ†’ improve No static training on "good kernel" examples. The model writes a kernel, runs it on real hardware, measures the speed, uses the feedback. Thousands of iterations. Each one teaching what makes CUDA fast versus merely correct. 2๏ธโƒฃ The environment is the teacher Automated verification catches bugs. Profiling provides precise timing. The model doesn't learn "this is better." It learns WHY. Specific optimizations mapped to measurable speedups. 3๏ธโƒฃ KernelBench Level-3 results 92% faster rate over torch.compile on the hardest problems. Claude Opus 4.5: ~52%. Gemini 3 Pro: ~52%. A 7B RL-trained model. Beating 100x larger general-purpose models at their weakest task. What this means for GPU engineers: Compiler-level optimization becomes a capability you fine-tune into a model. Not a system you maintain. The $300K CUDA engineer doesn't get replaced. The $300K CUDA engineer with this tool writes kernels the one without can't touch. Which optimization would you test first? ๐Ÿ‘‡ ๐Ÿ’พ Bookmark this. The next kernel you hand-optimize might be the last one you have to.
41

Paolo Perrone

Tech & AI

2mo

You're losing money on every international payment. Here's how: Stripe for payments. Wise for transfers. Chargebee for billing. Ramp for spend management. 4 vendors. 4 integrations. 4 invoices. 4 points of failure. Airwallex replaces all 4. McLaren Racing, Canva, and Brex already made the switch. Here's what one platform gives you: 1๏ธโƒฃ Global accounts in 60 seconds Open in any country from your laptop. No local entity required. No 6-week bank paperwork. Brex and Deel use this to onboard international clients overnight. 2๏ธโƒฃ Hold 20+ currencies without forced conversion Banks profit every time you convert. Airwallex doesn't force it. Save 4-6% per international transaction. On $1M in annual cross-border payments, that's $40-60K back in your pocket. 3๏ธโƒฃ 160+ local payment methods at checkout Alipay, WeChat Pay, iDEAL, SEPA. Not just Visa and Mastercard. Canva uses this to collect payments in 190+ countries. 4๏ธโƒฃ 95% of payments bypass SWIFT Same-day settlement vs 3-5 business days. Transparent fees vs hidden markups. Your finance team stops chasing wire confirmations. $260B annualized volume. 200,000+ customers. Not 4 tools. Not 4 integrations. Not 4 contracts. One platform. Global from day one. Are you still stitching together 4 vendors for international payments? ๐Ÿ‘‡ โ™ป๏ธ Repost for a founder drowning in cross-border tool sprawl.
30

Paolo Perrone

Tech & AI

2mo

At 22, a Stanford DAWN lab researcher was doing inference research alongside the team building ChatGPT. Matei Zaharia (Databricks CTO) called him "one of his best undergraduate students ever." 7 years later, he shipped what Cursor should have built long ago. I got an early look. Here's why every production engineer should pay attention: 41% of all code is now AI-generated. Nobody checks what happens when it hits production. PlayerZero fills that gap. Autonomously. It connects to your codebase, your observability stack, and your support platform. Then it builds a living model of how your entire production system actually behaves. Here's what it replaces: 1๏ธโƒฃ Manual incident investigation A customer files a ticket: "payments broken for UK users." PlayerZero traces it to the exact PR, the config change, the affected customers, and sends the fix to the right engineer. Minutes, not the 4 hours your senior SRE spends stitching logs. 2๏ธโƒฃ Your entire QA bottleneck No test scripts. No test infrastructure. No seeded databases. It simulates how your code will behave in production using real customer patterns. 92.6% accuracy across 3,000+ production test cases. 3๏ธโƒฃ The institutional knowledge that walks out the door Every resolved bug, every incident, every edge case feeds back into the model. When your best engineer quits, the knowledge stays. Cursor ends at the PR. PlayerZero starts there. CEOs of Vercel, Figma, and Dropbox saw a demo and wrote checks worth $20m. Are you still debugging production with 5 dashboards and a prayer? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone whose "all tests passed" keeps breaking prod.
40

Paolo Perrone

Tech & AI

2mo

The algorithm behind DeepSeek-R1 is the most important fine-tuning technique of 2026. Most teams haven't touched it yet. The old way: thousands of labeled examples, custom reward functions, weeks of data curation. The new way: GRPO + RULER. No labeled data. No reward functions. A fine-tuned 3B model outperforms GPT at 1/100th the cost. Here's the stack: 1๏ธโƒฃ Forget SFT for agents SFT teaches what to say. Not how to succeed. Reinforcement learning is the new default. 2๏ธโƒฃ GRPO: DeepSeek-R1's secret weapon Generate N completions. Score them relative to each other. Reinforce the winners. No separate reward model. Just relative rankings. 3๏ธโƒฃ RULER: the reward function you never have to write "Rate this 0-10" = inconsistent garbage. "Which of these 4 attempts best achieved the goal?" = reliable. Zero labeled data required. 4๏ธโƒฃ ART: 100% open-source framework that connects it all Native tool calls and multi-turn agents. LangGraph, CrewAI, ADK integrations. vLLM for inference. Unsloth-powered GRPO for training. New LoRA checkpoint loads automatically after each step. Provide any tool server URL. ART queries the tools, generates tasks, trains the model. Automatically. Stop paying $0.03/token for tasks a fine-tuned 3B model handles better. Which task would you fine-tune a small model for first? ๐Ÿ‘‡ โ™ป๏ธ Repost for someone still brute-forcing everything with prompt engineering.
30
Paolo Perrone Recent LinkedIn Posts | EXEED AI