From 2480ms to 815ms
Let's start with the numbers.
Time from user message to Pepper's response. It started at 2480ms. Now it's 815ms. A 67% reduction.
But the real subject of this episode isn't the number.
The bottleneck wasn't the LLM
When it felt slow, I assumed it was the LLM — tried different models, trimmed prompts.
Then I actually measured.
DB INSERT: 241ms (start logging) LLM call: 481ms (actual AI generation) DB UPDATE: 741ms (finish logging) ───────────────── Total: 1463ms
The LLM was 481ms. The database writes were 982ms. Logging cost twice as much as thinking.
The issue was the structure. Wait for INSERT, run the LLM, wait for UPDATE — everything queued in sequence.
Parallel + fire-and-forget
// Before: DB blocks both ends of the LLM call const log = await db.insert(...) // 241ms wait const result = await llm.call() // 481ms wait await db.update(log.id, ...) // 741ms wait // total: ~1463ms // After: parallel + fire-and-forget const insertPromise = db.insert(...) // no await — runs alongside LLM const result = await llm.call() // 481ms (the real bottleneck) insertPromise.then(({ data }) => { db.update(data.id, ...).catch(() => null) // background, after response }) // total: ~481ms
INSERT starts at the same time as the LLM call, no await. UPDATE happens in the background after the response is already sent. The log finishes a little later — that's fine. The user doesn't need to wait for it.
Logging isn't the critical path. But it was blocking the critical path.
That alone wasn't enough
815ms. Still felt slow.
The reason 0.8 seconds feels long: there's no feedback. Press send and the screen doesn't move — even 300ms feels like something is broken.
Two things changed that.
One: the message appears on screen the moment you send it. No waiting for the server. A clock icon while it's in flight, then a checkmark on confirmation.
Two: while Pepper is thinking, a
... typing indicator appears immediately. Pepper seems to be "typing" before the response actually exists.
// Insert a typing row before LLM generation even starts const typingRow = await db.from('chat_messages').insert({ is_typing: true, content: null, // ... }) // After LLM completes, UPDATE the same row with the actual content
The mobile app subscribes to this row via Realtime. When
is_typing flips to false and content appears, the dots are replaced with the real message.
The actual speed is still 0.8 seconds. But the experience feels different. When something is visibly happening, waiting is tolerable.
Being fast and feeling fast are different things — and sometimes feeling fast matters more.
The real bottleneck now is LLM call time. That's irreducible for the moment. But what users see while they wait is something I can control.