EP 10

Pepper, Do You Know What This Is?

May 2026·7 min read·#architecture#triage#llm

Triage started as three categories.

small_talk

recall

action

Seemed like enough. It wasn't.

Three wasn't enough

"What's the weather today?" and "What's my average cost on GOOGL?" — both are information requests. But they need completely different handling.

Weather requires an external API call. Average cost requires searching the Vault. You can't put them in the same bucket.

"I need to submit the report by tomorrow" and "Send Eunsoo a notification" — both involve doing something. But one is a task I need to do, and the other is something Pepper executes directly. Different again.

So it expanded to six.

small_talk  — LLM knowledge is enough. "What should I eat today?"
recall      — pull from my Vault. "What's my GOOGL average cost?"
store       — write to Vault. "Save GOOGL at $245"
task        — register something for me to do. "Submit the report by tomorrow"
lookup      — real-time external info. "What's the weather today?"
capability  — Pepper executes directly. "Send Eunsoo a notification"

The edge cases are where it gets interesting

Clean categories would be ideal. Reality isn't clean.

"What's going on with North Korea lately?" — is that

small_talk

because LLM training data covers it, or

lookup

because it's a current event? It kind of reads like an opinion request too.

"Delete all my events this week" — is that a two-step process (lookup then delete), or a single

capability

These cases were embedded directly into the prompt.

- "lately", "current", "recent" + external info → lookup
- delete / bulk modify: even if internal lookup is required → capability, one step (no multi-step split)
- sounds like asking for an opinion, but it's about an ongoing event → lookup

Handling edge cases was most of the prompt engineering work.

Multi-step handling

"Tell me today's weather and the exchange rate" — two intents in one message.

Single intent returns an array of length one. Multiple intents go in order.

// single
{ intents: ['lookup'], confidence: 0.95 }

// multi
{ intents: ['lookup', 'lookup'], confidence: 0.85 }
// same intent twice still appears twice — never collapsed into one

// currently a stub
if (triageResult.intents.length > 1) {
  responseText = 'Handling multiple things at once is coming soon!'
}

Multi-step execution is still a stub. A planner is needed. For now it's "coming soon" in the backlog.

Test suite: 50 cases, 94% pass rate

I wrote 50 test cases to verify that triage was classifying correctly.

T-001: "hi" → small_talk ✅
T-012: "What's my GOOGL average cost?" → recall ✅
T-023: "What's the weather today?" → lookup ✅
T-031: "Send Eunsoo a notification at 3pm" → capability ✅
T-044: "What are the latest AI trends?" → lookup ❌ (misclassified as small_talk)
T-048: "Summarize the KHH meeting and save it" → store+task ❌

48/50. 94%.

The two failures are genuine gray zones. "Latest AI trends" could go either way — LLM training data is often sufficient, but real-time context might matter. Acceptable for now.

A structure for incorporating real feedback over time is already in the backlog. The 👎 reaction already writes

user_feedback = -1

pepper_logs

. The data collection is there — the training loop just hasn't been built yet.

Model cost splits here too

Triage itself runs on the cheapest model.

await generateWithFallback(['1A', '1C'], prompt, ...)
// 1A: Gemini Flash-Lite (cheapest)
// 1C: Claude Haiku (fallback)

There's no reason to use an expensive model just to classify intent. Heavy models only come in for actual response generation — Sonnet for recall, Flash-Lite for small talk. Each path gets the model it actually needs.

This is what the cost architecture from EP 01 looks like in practice. "Lightweight decisions use cheap models" — now it's running code.

With triage at six categories, Pepper is noticeably less confused.

What's next: lookup actually hitting external APIs, capability actually executing, and STATE B where Pepper writes its own code. That's the real finish line for Phase 0.