FRIDAY Phase 4: Arena Routing
Multi-model bidding · Cost-aware brain · Budget guard · Scheduler · Dashboard. All shipped this week. Phase 5 — voice + browser worker — next.
For most of 2025 I had one rule for FRIDAY: the AI on my laptop should never feel less capable than the AI on someone else's server. That meant beating GPT-4 not on raw capability — that was never the goal — but on task-fit per dollar. Phase 4 is where that actually started working.
The problem: a default model is a tax
Every "agent framework" I tried defaulted to one model. Claude 3.5. GPT-4o. Llama 3 70B. The default was always good enough. But "good enough" for a 200-token reply about my calendar costs the same as "good enough" for a 4,000-token codebase audit.
Worse: half the requests routed to the default were trivial. Renaming a file. Reading a setting. Stuff a 7B local model handles fine. I was paying frontier-model prices for tasks a Raspberry Pi could finish.
The hypothesis: if FRIDAY can decide which model before it commits a single token, the average task cost drops by an order of magnitude — without quality drop on the tasks that actually need a frontier model.
What "Arena" actually does
The Arena module sits in front of the orchestrator. When a task arrives, it doesn't pick a model. It runs an auction.
Each registered model — local Ollama checkpoints, Claude API, OpenAI, Gemini Flash — submits a bid. A bid is a tuple:
Bid(
model_id: str,
estimated_tokens_in: int,
estimated_tokens_out: int,
estimated_cost_usd: float,
self_reported_confidence: float, # 0.0 – 1.0
latency_p50_ms: int,
)
The Arena scorer multiplies confidence by a domain-fit weight (read from telemetry — did this model historically nail this task class?) and divides by cost. The highest score wins.
The cost-aware brain
The bidding alone wasn't enough. Models lie about their confidence — that's a known issue. So Phase 4 added a budget guard that sits above the Arena.
The budget guard does three things:
- Caps each task at a per-task ceiling (configurable: I run $0.01).
- If the winning bid exceeds the ceiling, it forces a re-auction with the frontier excluded — local models bid again with a relaxed confidence threshold.
- If a task fails, the next attempt costs go on a daily rolling budget. Hit the daily ceiling, the next 24 hours route local-only.
This means a runaway agent can't drain my Anthropic balance overnight. Two weeks in, the daily ceiling has been hit zero times.
Scheduler + dashboard
Phase 4 also shipped two unsexy but load-bearing modules:
- Scheduler — cron-style, but tasks are stored in SQLite with idempotency keys. A task that runs every 5 minutes will not fan out into a queue if the previous run hasn't completed.
- Dashboard — a tiny localhost web view. Every Arena auction, every budget hit, every task outcome. Searchable. The first time I opened it I found a bug in my prompt builder I'd been ignoring for weeks because nothing surfaced it.
The dashboard was the thing that made me trust Arena. Until you can see the auctions, you assume the brain is doing something dumb. Once you see them, you trust the system more than you trust your own gut on routing.
What's next: Phase 5
The voice loop and browser worker are the last two interfaces FRIDAY needs to be a complete personal agent. Both have safety implications I'm taking my time on — a voice agent with command authority and a browser worker that can fill out forms is exactly the kind of system that needs shadow-sim before it goes live.
Memory graph, Arena, budget guard, scheduler, dashboard — these are the boring parts. Phase 5 is where the magic returns. Soon.
FRIDAY is local-first by construction. Runs on M5 Pro 48GB via Ollama. No code leaves the laptop unless I explicitly route to a hosted model — and even then, the routing decision is logged, attributable, and capped. Source pending public release after Phase 5.