The ThinkPad That Changed My Mind About Local LLMs

For years I operated on a simple rule. If you want to run a real local language model, you need a discrete GPU with a lot of VRAM, and the realistic floor was a desktop with a 4090 or its equivalent. A laptop, even a beefy one, was a compromise. The 8 to 12 GB of VRAM in a top-tier mobile dGPU got you 7B and 8B models at low quantization, fine, but the moment you wanted a 27B or a 32B you were back at the desktop. So when I started looking at AMD's Strix Halo platform (the Ryzen AI Max+ 395 with an integrated Radeon 8060S iGPU and 64-128 GB of unified LPDDR5X memory) I was skeptical. The marketing kept emphasizing the unified memory, which sounded like a polite way of saying "no real GPU." Same-spec laptops with discrete GPUs sell for double or triple the price, and that gap, in my experience, usually means you are getting what you pay for.

I bought a ThinkPad with the Ryzen AI Max+ and a generously provisioned unified memory pool anyway, mostly out of curiosity. My expectations were low. I assumed it would be a fine workstation laptop, that the LLM benchmarks would be a side dish, and that I would still keep my desktop for serious inference. I have been wrong about technology before, but I have rarely been this wrong this fast.

The frame I had been carrying for years (VRAM is the bottleneck, GPUs are the path, unified memory is a polite consolation) turned out to be mostly out of date. On Strix Halo, the iGPU and the CPU share the same physical memory pool. There is no copy across a PCIe bus, no VRAM ceiling distinct from system RAM, no juggling between layers offloaded to GPU and layers running on CPU. You point a model at the unified pool and it runs.

In practice that means I can load a Gemma-3-27B at Q4 on this laptop and get response latency that is, conservatively, in the same ballpark as what I used to get on a desktop dGPU running a Llama-3.1-8B at higher quantization. A 32B at Q4 is usable. A 70B at low quant is technically loadable into the larger memory configurations, but slow enough that I would not use it interactively. For everything below that, this is a real tool. I keep checking my numbers because they feel wrong, and they are not wrong.

This is the part I think is most underappreciated. A workstation laptop with a comparable mobile dGPU (something with 16 GB of VRAM or more) starts at roughly two to three times what I paid for this. And those discrete-GPU laptops, for local LLM purposes, are worse: their VRAM ceiling is lower, their CPU and GPU memory are split, and their thermal envelope under sustained load is harder to manage. I went in expecting "decent laptop, probably worth the money." I came out with "this is the cheapest local-LLM workstation I own, and it is also a laptop I can take on a plane." That is not a sentence I expected to write.

I do not want to oversell. The Strix Halo iGPU is not faster than a 4090 at the things a 4090 is good at. If you want to fine-tune a 13B from scratch, or run training at meaningful scale, this is not the machine. If you want fast inference on a small model with low latency targets, a discrete GPU is still better tokens-per-second per dollar at small model sizes.

The thing Strix Halo does that nothing else at this price tier does is hold a really large model in memory and run it at acceptable speed. Where a discrete-GPU laptop runs out of VRAM, this one keeps going. Where a small unified-memory laptop runs out of total memory, this one keeps going. The 64-128 GB of pooled memory is the actual product, and that is the part my mental model had not caught up to.

Most of the time it is my normal development laptop. Editor, browser, terminal, the usual. The fans are essentially silent at idle, and the chassis is the standard ThinkPad business build (matte, sturdy, repairable, decent keyboard). When I need a model, I open hydra-llm, which is the wrapper I built around llama.cpp specifically because juggling models on this laptop was a regular activity. The halo tier in its catalog is named after this exact platform. I run a Gemma-3-27B for general work, a Qwen-2.5-32B when I need stronger reasoning, and a Llama-3.1-8B when I want lower latency. They all fit in memory at the same time, technically, although I rarely run more than one at once.

The fans spin up under sustained load, but the chassis stays manageable. The thermal envelope is meaningfully better than I expected; I have run multi-hour inference sessions without throttling, which I cannot say for any laptop dGPU I have used.

Beyond the silicon, this is also a ThinkPad. That does real work for me independent of the LLM story. The keyboard is a TrackPoint keyboard, which I missed every time I used a non-ThinkPad laptop in the past decade. The chassis opens with a screwdriver and the storage is replaceable. thinkpad_acpi is one of the best-supported drivers in the kernel, which means inhibit-charge works flawlessly on this laptop and the battery ages gracefully because I park it at 60% all day. Fingerprint reader works. Webcam shutter is mechanical. The little things add up.

It is not a particularly thin laptop, and it is not pretending to be. The weight and thickness are the budget for the cooling, the battery, the unified memory, and the I/O. I am fine with that tradeoff. If I wanted a thin laptop I would have bought a thin laptop.

Stable. Standard kernel, standard mesa, standard userspace. The ROCm story for AMD iGPUs is workable. llama.cpp with Vulkan or HIP backends both run; I use Vulkan most of the time because it is simpler and the performance delta is not enough for me to care. Everything ThinkPad-specific (battery management, function keys, sensors) works out of the box on a recent Ubuntu or Fedora. I have not had a single hardware regression that required searching forums for an obscure kernel parameter, which is rare for a laptop this new and worth saying out loud.

I bought this with low expectations and I have been overwhelmed by how much I like it. The performance-per-dollar story for local language models is so strongly in this laptop's favor right now that I am going to keep recommending it to anyone whose primary local-LLM bottleneck is VRAM, which is most people.

The advice I want to give my past self is simple. The thing you knew about needing a discrete GPU was true for a long time. It stopped being true quietly, while you were not paying attention. Unified memory at this scale changes the answer. The same-spec laptops with discrete GPUs that go for double the price are not actually the same product anymore.

I am writing this article on this laptop. Earlier today it was running a 27B model in the background while I was browsing. The fan never spun up. Six months ago I would have called that science fiction. If your local-LLM rule of thumb is "you need a real GPU," it is worth checking whether that rule is still load-bearing. For me, it stopped being.