Running large language models locally stopped being a hobbyist experiment somewhere in late 2025. With open-weight models like Qwen3.5, Llama 4, and gpt-oss reaching production quality, and Apple Silicon’s M5 generation bringing GPU Neural Accelerators to consumer hardware, local inference is now a legitimate deployment strategy for teams that care about cost, latency, and data privacy.
Here’s what a complete local LLM workflow looks like in mid-2026 — from discovering and downloading models, through running them efficiently on Apple Silicon, to compressing token usage when you need to route to cloud providers. You’ll get concrete commands, version numbers, and cost comparisons.
The Model Management Picture in 2026
The tooling around local models has matured over the past eighteen months. Where early 2025 meant juggling Python scripts and manual checkpoint downloads, the current landscape offers several polished options:
- Ollama remains the most popular local inference engine, now at version 0.19.
- LM Studio provides a GUI-focused alternative for browsing, downloading, and chatting with models.
- Ollama’s own desktop app (launched July 2025) brought a native macOS interface — we’ll use the CLI in this guide for scriptable workflows.
- Hugging Face remains the primary discovery hub for open-weight models, with increasingly refined search and comparison tools.
For this tutorial, we’ll focus on Ollama as the runtime and Hugging Face as the source — that combination covers the widest range of use cases and integrates cleanly with modern coding agents.
Setting Up Ollama for Local Inference
Installation
curl -fsSL https://ollama.com/install.sh | sh This installs both the CLI and the background service. Once it’s running, you can pull models directly:
ollama pull qwen3.5:35b-a3b-coding-nvfp4 What Changed in 0.19
The MLX backend delivers measurable improvements. On an M5 Ultra running Qwen3.5-35B-A3B, Ollama 0.18 managed 1,154 tokens/s prefill and 58 tokens/s decode. Version 0.19 pushes that to 1,810 tokens/s prefill and 112 tokens/s decode — roughly 1.6x and 1.9x improvements respectively. In practical terms, a 500-token code review generates in about 4.5 seconds, and a 2,000-token refactoring suggestion finishes in under 18 seconds.
Three other features in 0.19 matter for production workflows:
- NVFP4 quantization — NVIDIA’s 4-bit floating-point format, which delivers better quality than traditional Q4_K_M at the same bit width.
- Improved caching — intelligent checkpoint management and smarter eviction reduce memory pressure during long sessions.
ollama launch— a command added in January 2026 that auto-configures Claude Code, OpenCode, or Codex to use either local or cloud models. This bridges the gap between local inference and agentic coding workflows.
Quick Test
ollama run qwen3.5:35b-a3b-coding-nvfp4 "Explain quantization in three sentences." MLX: The Engine Behind the Engine
The framework’s key design decision is unified memory — on Apple Silicon, the CPU and GPU share the same memory pool, so there’s no data transfer bottleneck between processors. This is why MLX models can load larger contexts on the same hardware than frameworks that require separate CPU/GPU memory copies.
MLX offers APIs in Python (NumPy-compatible), C++, C, and Swift. The higher-level mlx.nn and mlx.optimizers modules provide PyTorch-like APIs for building and fine-tuning models. As of version 0.31.x, MLX also supports a CUDA backend on Linux, though its primary home remains macOS.
For most users, you won’t interact with MLX directly — Ollama wraps it. But if you’re fine-tuning or running model variants that aren’t in Ollama’s library, mlx-lm is the tool to reach for:
pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-4bit But local inference isn’t the whole story. Some workloads — complex reasoning chains, large-scale refactoring, agentic loops — still benefit from cloud models, and managing that hybrid setup requires a different kind of tool.
Edgee: Token Compression for Hybrid Workflows
- Layer 1 (Input) — trims tool results and conversation history. For instance, if a tool call returns 20 KB of JSON but only the status field matters, Edgee strips the rest before sending it to the provider.
- Layer 2 (Output) — enforces brevity constraints on model responses.
Edgee claims up to 50% cost reduction on provider bills through these compression techniques. It integrates with Claude Code, Codex, OpenCode, and Cursor — all without code changes, because it runs as a transparent proxy.
curl -fsSL https://edgee.ai/install.sh | bash Edgee supports bring-your-own-key and fallback model routing, with SOC 2 and GDPR compliance for teams that need it. A practical setup: run Ollama locally for quick iterations, route through Edgee to a cloud provider for heavy lifting.
Step-by-Step: From Hugging Face to Local Deployment
1. Discover a Model
Browse Hugging Face for open-weight models. Filter by license (Llama 4 Community License, Apache 2.0, MIT) and look for MLX-converted variants — they’re tagged under the mlx-community namespace.
Which model to pick? Qwen3.5 variants excel at code generation and structured reasoning. Llama 4 models are stronger for chat and creative writing. For teams doing both, serve multiple models and route by task type using Ollama’s model management.
2. Pull It Locally
ollama pull qwen3.5:35b-a3b-coding-nvfp4 This downloads the model and optimizes it for your hardware on first load.
3. Test the Model
ollama run qwen3.5:35b-a3b-coding-nvfp4 Try a task relevant to your use case — code generation, summarization, or structured data extraction.
4. Route Through Edgee (Optional)
As covered in the Edgee section above, run:
edgee launch claude This enables two-layer compression on requests heading to the cloud, with automatic fallback to your local Ollama instance if the cloud model is unavailable.
5. Integrate with Coding Agents
ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4 This configures the agent to use your local model for most tasks, with optional cloud failover.
Local vs Cloud: The Cost Reality
Pricing in mid-2026 makes the local argument straightforward:
| Scenario | Cost per 1M tokens (input) | Dedicated hardware cost |
|---|---|---|
| GPT-5.4 via API | $2.50 | $0 |
| Claude Sonnet 4.6 via API | $3.00 | $0 |
| Qwen3.5-35B-A3B (local, M5 Ultra) | $0 (electricity only) | $0 (already owned) |
| Dedicated cloud inference (A100 80GB)* | ~$1.00 | ~$1.50/hr |
* Cloud instance pricing varies by provider and region.
If you already own an Apple Silicon machine, local inference is free beyond electricity costs. For teams scaling to millions of tokens per month, the break-even point against cloud APIs arrives quickly — typically within the first 1-2 months of active use, depending on token volume.
Key insight: Edgee’s compression becomes relevant even in local-first setups — when you do route to the cloud, every token should count.
Privacy and Team Considerations
Local inference eliminates the data exposure risks of sending code, internal documents, or customer data to third-party APIs. For teams subject to SOC 2, GDPR, or HIPAA compliance, this is often the deciding factor.
Edgee’s Gateway supports bring-your-own-key and fallback model routing — if your local instance goes down, it routes to a cloud provider without interrupting the workflow. The Gateway also logs token usage and latency per team member, giving you observability without sacrificing privacy.
Hardware Requirements
Here’s what the Ollama + MLX stack demands in practice:
- Minimum: Apple M2 Pro with 16 GB unified memory — runs 7B-8B models comfortably.
- Recommended: Apple M4 Pro / M5 Pro with 24-36 GB — handles 14B-35B models with solid throughput.
- Ideal: M5 Ultra with 64+ GB — runs 35B-72B models and supports larger context windows.
The key metric is unified memory, not just compute. MLX’s architecture means your model size is constrained by available memory. A 35B parameter model in 4-bit quantization uses roughly 18-20 GB, so 24 GB is the practical minimum for models in that range.
Local inference isn’t a silver bullet. Frontier models like GPT-5.4 and Claude Sonnet 4.6 still win on reasoning depth and breadth of knowledge. Setting up and maintaining a local stack also carries overhead that zero-ops API calls don’t — and an M5 Ultra with 64 GB isn’t cheap. The trade-off is cost and privacy versus capability: choose based on your workload, not ideology.
Local-First Is a Production Strategy
The 2026 toolchain for local LLMs has reached the maturity of cloud-based alternatives. Ollama handles deployment, MLX handles performance, and Edgee handles the cost-aware bridge to the cloud when you need it.
For teams adopting coding agents like Claude Code or OpenCode, the combination of ollama launch, NVFP4 quantization, and Edgee’s token compression creates a workflow that’s faster, cheaper, and more private than cloud-only alternatives. The local-first approach isn’t a compromise anymore — it’s a viable production strategy with clear advantages.
Further Reading
- Ollama — Official Website — Download the engine, browse the model library, and explore the CLI reference. The MLX backend announcement details the 0.19 performance benchmarks.
- MLX on GitHub — Apple’s open-source machine learning framework with source code, releases, and community examples for Apple Silicon.
- Edgee — Token Compression — Deep dive into Edgee’s two-layer compression approach, benchmarks, and cost-reduction case studies for coding agent workflows.
- Hugging Face mlx-community Models — Curated collection of MLX-converted open-weight models, ready for local deployment on Apple Silicon.
No comments yet