KYL Solutions
From KYL Solutions

Down the
Rabbit Hole

Last time we laid out the thesis. This is the stuff that keeps us up at night — research threads that open better questions, not neat answers. Grab a coffee. Fall in.


The Hardware Moat Is Falling

The assumption that AI needs expensive GPUs, data centres, and monthly cloud fees is being challenged from every angle. These are the threads worth pulling.

Video KTG Analysis ~15 min

BitNet: Run 100B AI Models on Your CPU — No GPU Needed

1-bit quantisation: model weights represented as {-1, 0, 1} instead of floating point. If this scales, CPUs become viable for AI inference. The entire cost model changes.

Video Fireship ~5 min

Big Tech in Panic Mode… Did DeepSeek R1 Just Pop the AI Bubble?

DeepSeek trained for $5.6M what OpenAI spent $100M+ on, and wiped $600B off NVIDIA's market cap in a day. The Jevons Paradox angle is the real mind-bender: cheaper AI doesn't reduce demand — it explodes it.

Video YouTube ~12 min

Apple's M5 Max Changes the Local AI Story

Real benchmarks: M5 Max running LLMs 2.4-4x faster than M4 Max via MLX. Apple is quietly building unified-memory machines that run 670B-parameter models locally. No other consumer hardware comes close.

Deep Cut AI Bites ~10 min

LLaDA: Large Language Diffusion Models

What if the entire "predict next token" paradigm that powers every LLM is the wrong architecture? LLaDA generates text via diffusion — denoising all tokens simultaneously — and matches LLaMA3 8B. Early, but genuinely paradigm-questioning.


The Architecture That Replaces Chat

The next generation of AI isn't a chatbot. It's small models drafting, big models verifying, and everything running closer to where the work happens.

Research Google Research 10 min read

Looking Back at Speculative Decoding

From the inventors. Small model drafts tokens, big model verifies — 3x faster inference, identical output quality. Google already ships this in Search AI Overviews. The hybrid thesis isn't theoretical.

Article NVIDIA Developer Blog 12 min read

Introduction to Speculative Decoding

NVIDIA's own explanation of why small + big beats just big. The counterintuitive insight: GPUs sit idle 98% of the time waiting for memory. Running two models is faster than running one.

Video NetworkChuck ~20 min

Host ALL Your AI Locally

The practical version: setting up a local AI server with open-source models, running inference on your own hardware. No API keys, no monthly fees, no data leaving your network. This is what "ownable AI" looks like in practice.


The Bigger Picture

Zoom out. If the hardware moat falls, the training moat falls, and the inference moat falls — what's left to charge a licence fee for?

Research arXiv Academic paper

Native LLM and MLLM Inference at Scale on Apple Silicon

Academic paper showing MLX on Apple Silicon isn't a toy. M-series Macs running 70B+ parameter models with serious throughput. The hardware thesis behind the Apple story — written by researchers, not marketing.

Deep Cut Hugging Face / Microsoft Model card

BitNet b1.58 2B4T — First Natively Trained 1-Bit LLM

Where BitNet stops being a research paper and starts being a deployable model. MIT licensed. 4 trillion training tokens. 2 billion parameters in ternary weights. Download it, run it, own it.

Article Stratechery (Ben Thompson) 15 min read

The Benefits of Bubbles

Ben Thompson's thesis: AI bubble spending on physical infrastructure — power, fabs, data centres — has lasting value even if the bubble pops. The value is shifting from software back to physical things. If AI commoditises intelligence, the bottlenecks become copper and cooling systems.

"The assumption that inference equals expensive GPUs, equals cloud, equals monthly licence fees — is being dismantled from every direction at once. We don't know exactly what replaces it. But we know enough to start building."

— KYL SOLUTIONS