Last time we laid out the thesis. This is the stuff that keeps us up at night — research threads that open better questions, not neat answers. Grab a coffee. Fall in.
The assumption that AI needs expensive GPUs, data centres, and monthly cloud fees is being challenged from every angle. These are the threads worth pulling.
1-bit quantisation: model weights represented as {-1, 0, 1} instead of floating point. If this scales, CPUs become viable for AI inference. The entire cost model changes.
DeepSeek trained for $5.6M what OpenAI spent $100M+ on, and wiped $600B off NVIDIA's market cap in a day. The Jevons Paradox angle is the real mind-bender: cheaper AI doesn't reduce demand — it explodes it.
Real benchmarks: M5 Max running LLMs 2.4-4x faster than M4 Max via MLX. Apple is quietly building unified-memory machines that run 670B-parameter models locally. No other consumer hardware comes close.
What if the entire "predict next token" paradigm that powers every LLM is the wrong architecture? LLaDA generates text via diffusion — denoising all tokens simultaneously — and matches LLaMA3 8B. Early, but genuinely paradigm-questioning.
The next generation of AI isn't a chatbot. It's small models drafting, big models verifying, and everything running closer to where the work happens.
From the inventors. Small model drafts tokens, big model verifies — 3x faster inference, identical output quality. Google already ships this in Search AI Overviews. The hybrid thesis isn't theoretical.
NVIDIA's own explanation of why small + big beats just big. The counterintuitive insight: GPUs sit idle 98% of the time waiting for memory. Running two models is faster than running one.
The practical version: setting up a local AI server with open-source models, running inference on your own hardware. No API keys, no monthly fees, no data leaving your network. This is what "ownable AI" looks like in practice.
Zoom out. If the hardware moat falls, the training moat falls, and the inference moat falls — what's left to charge a licence fee for?
Academic paper showing MLX on Apple Silicon isn't a toy. M-series Macs running 70B+ parameter models with serious throughput. The hardware thesis behind the Apple story — written by researchers, not marketing.
Where BitNet stops being a research paper and starts being a deployable model. MIT licensed. 4 trillion training tokens. 2 billion parameters in ternary weights. Download it, run it, own it.
Ben Thompson's thesis: AI bubble spending on physical infrastructure — power, fabs, data centres — has lasting value even if the bubble pops. The value is shifting from software back to physical things. If AI commoditises intelligence, the bottlenecks become copper and cooling systems.
"The assumption that inference equals expensive GPUs, equals cloud, equals monthly licence fees — is being dismantled from every direction at once. We don't know exactly what replaces it. But we know enough to start building."