Guides

Your Local LLM Workflow in 2026: From Model Management to Production

A step-by-step tutorial on setting up a modern local LLM workflow in mid-2026, covering Ollama, MLX, and Edgee with cost comparisons vs cloud.

Running large language models locally stopped being a hobbyist experiment somewhere in late 2025. With open-weight models like Qwen3.5, Llama 4, and gpt-oss reaching production quality, and Apple Silicon’s M5 generation bringing GPU Neural Accelerators to consumer hardware, local inference is now a legitimate deployment strategy for teams that care about cost, latency, and data privacy.

Here’s what a complete local LLM workflow looks like in mid-2026 — from discovering and downloading models, through running them efficiently on Apple Silicon, to compressing token usage when you need to route to cloud providers. You’ll get concrete commands, version numbers, and cost comparisons.

The Model Management Picture in 2026

The tooling around local models has matured over the past eighteen months. Where early 2025 meant juggling Python scripts and manual checkpoint downloads, the current landscape offers several polished options:

For this tutorial, we’ll focus on Ollama as the runtime and Hugging Face as the source — that combination covers the widest range of use cases and integrates cleanly with modern coding agents.

Setting Up Ollama for Local Inference

Ollama 0.19, released March 30, 2026, is a significant milestone. The engine now runs on MLX under the hood on Apple Silicon, which brings Apple’s unified memory architecture into the inference pipeline. No extra config needed.

Installation

curl -fsSL https://ollama.com/install.sh | sh

This installs both the CLI and the background service. Once it’s running, you can pull models directly:

ollama pull qwen3.5:35b-a3b-coding-nvfp4

What Changed in 0.19

The MLX backend delivers measurable improvements. On an M5 Ultra running Qwen3.5-35B-A3B, Ollama 0.18 managed 1,154 tokens/s prefill and 58 tokens/s decode. Version 0.19 pushes that to 1,810 tokens/s prefill and 112 tokens/s decode — roughly 1.6x and 1.9x improvements respectively. In practical terms, a 500-token code review generates in about 4.5 seconds, and a 2,000-token refactoring suggestion finishes in under 18 seconds.

Three other features in 0.19 matter for production workflows:

  • NVFP4 quantization — NVIDIA’s 4-bit floating-point format, which delivers better quality than traditional Q4_K_M at the same bit width.
  • Improved caching — intelligent checkpoint management and smarter eviction reduce memory pressure during long sessions.
  • ollama launch — a command added in January 2026 that auto-configures Claude Code, OpenCode, or Codex to use either local or cloud models. This bridges the gap between local inference and agentic coding workflows.

Quick Test

ollama run qwen3.5:35b-a3b-coding-nvfp4 "Explain quantization in three sentences."

MLX: The Engine Behind the Engine

MLX is Apple’s machine learning framework, developed by Apple ML Research and now at version 0.31.2 (released April 22, 2026) with over 26,400 GitHub stars. Understanding MLX matters because whether you use Ollama or run models directly, it’s likely involved.

The framework’s key design decision is unified memory — on Apple Silicon, the CPU and GPU share the same memory pool, so there’s no data transfer bottleneck between processors. This is why MLX models can load larger contexts on the same hardware than frameworks that require separate CPU/GPU memory copies.

MLX offers APIs in Python (NumPy-compatible), C++, C, and Swift. The higher-level mlx.nn and mlx.optimizers modules provide PyTorch-like APIs for building and fine-tuning models. As of version 0.31.x, MLX also supports a CUDA backend on Linux, though its primary home remains macOS.

For most users, you won’t interact with MLX directly — Ollama wraps it. But if you’re fine-tuning or running model variants that aren’t in Ollama’s library, mlx-lm is the tool to reach for:

pip install mlx-lm
python -m mlx_lm.generate --model mlx-community/Qwen3.5-35B-A3B-4bit

But local inference isn’t the whole story. Some workloads — complex reasoning chains, large-scale refactoring, agentic loops — still benefit from cloud models, and managing that hybrid setup requires a different kind of tool.

Edgee: Token Compression for Hybrid Workflows

Edgee is an AI agent gateway that sits between your local environment and LLM providers. Its main value proposition is two-layer token compression:

  • Layer 1 (Input) — trims tool results and conversation history. For instance, if a tool call returns 20 KB of JSON but only the status field matters, Edgee strips the rest before sending it to the provider.
  • Layer 2 (Output) — enforces brevity constraints on model responses.

Edgee claims up to 50% cost reduction on provider bills through these compression techniques. It integrates with Claude Code, Codex, OpenCode, and Cursor — all without code changes, because it runs as a transparent proxy.

curl -fsSL https://edgee.ai/install.sh | bash

Edgee supports bring-your-own-key and fallback model routing, with SOC 2 and GDPR compliance for teams that need it. A practical setup: run Ollama locally for quick iterations, route through Edgee to a cloud provider for heavy lifting.

Step-by-Step: From Hugging Face to Local Deployment

1. Discover a Model

Browse Hugging Face for open-weight models. Filter by license (Llama 4 Community License, Apache 2.0, MIT) and look for MLX-converted variants — they’re tagged under the mlx-community namespace.

Which model to pick? Qwen3.5 variants excel at code generation and structured reasoning. Llama 4 models are stronger for chat and creative writing. For teams doing both, serve multiple models and route by task type using Ollama’s model management.

2. Pull It Locally

ollama pull qwen3.5:35b-a3b-coding-nvfp4

This downloads the model and optimizes it for your hardware on first load.

3. Test the Model

ollama run qwen3.5:35b-a3b-coding-nvfp4

Try a task relevant to your use case — code generation, summarization, or structured data extraction.

4. Route Through Edgee (Optional)

As covered in the Edgee section above, run:

edgee launch claude

This enables two-layer compression on requests heading to the cloud, with automatic fallback to your local Ollama instance if the cloud model is unavailable.

5. Integrate with Coding Agents

ollama launch claude --model qwen3.5:35b-a3b-coding-nvfp4

This configures the agent to use your local model for most tasks, with optional cloud failover.

Local vs Cloud: The Cost Reality

Pricing in mid-2026 makes the local argument straightforward:

Scenario

Cost per 1M tokens (input)

Dedicated hardware cost

GPT-5.4 via API

$2.50

$0

Claude Sonnet 4.6 via API

$3.00

$0

Qwen3.5-35B-A3B (local, M5 Ultra)

$0 (electricity only)

$0 (already owned)

Dedicated cloud inference (A100 80GB)*

~$1.00

~$1.50/hr

* Cloud instance pricing varies by provider and region.

If you already own an Apple Silicon machine, local inference is free beyond electricity costs. For teams scaling to millions of tokens per month, the break-even point against cloud APIs arrives quickly — typically within the first 1-2 months of active use, depending on token volume.

Key insight: Edgee’s compression becomes relevant even in local-first setups — when you do route to the cloud, every token should count.

Privacy and Team Considerations

Local inference eliminates the data exposure risks of sending code, internal documents, or customer data to third-party APIs. For teams subject to SOC 2, GDPR, or HIPAA compliance, this is often the deciding factor.

Edgee’s Gateway supports bring-your-own-key and fallback model routing — if your local instance goes down, it routes to a cloud provider without interrupting the workflow. The Gateway also logs token usage and latency per team member, giving you observability without sacrificing privacy.

Hardware Requirements

Here’s what the Ollama + MLX stack demands in practice:

  • Minimum: Apple M2 Pro with 16 GB unified memory — runs 7B-8B models comfortably.
  • Recommended: Apple M4 Pro / M5 Pro with 24-36 GB — handles 14B-35B models with solid throughput.
  • Ideal: M5 Ultra with 64+ GB — runs 35B-72B models and supports larger context windows.

The key metric is unified memory, not just compute. MLX’s architecture means your model size is constrained by available memory. A 35B parameter model in 4-bit quantization uses roughly 18-20 GB, so 24 GB is the practical minimum for models in that range.

Local inference isn’t a silver bullet. Frontier models like GPT-5.4 and Claude Sonnet 4.6 still win on reasoning depth and breadth of knowledge. Setting up and maintaining a local stack also carries overhead that zero-ops API calls don’t — and an M5 Ultra with 64 GB isn’t cheap. The trade-off is cost and privacy versus capability: choose based on your workload, not ideology.

Local-First Is a Production Strategy

The 2026 toolchain for local LLMs has reached the maturity of cloud-based alternatives. Ollama handles deployment, MLX handles performance, and Edgee handles the cost-aware bridge to the cloud when you need it.

For teams adopting coding agents like Claude Code or OpenCode, the combination of ollama launch, NVFP4 quantization, and Edgee’s token compression creates a workflow that’s faster, cheaper, and more private than cloud-only alternatives. The local-first approach isn’t a compromise anymore — it’s a viable production strategy with clear advantages.

Further Reading

No comments yet

Live feed in your inbox

Track the tools. Lead the shift.

Tech leaders use Artificialus to stay ahead: editorial picks, agent comparisons, MCP updates, and signal-heavy analysis when it matters.

No spam. Only tools and shifts worth tracking.