EP 09

How to Give Pepper a Memory


May 2026·6 min read·#memory#context#embedding

LLMs have no memory.

Every request is stateless. Yesterday's conversation, a fact saved last week — none of it carries over by default.

So how can Pepper actually know our family?


Layer 1 — last 10 turns, injected directly

The most basic approach. The last 10 turns of conversation get passed into every request as part of the message array.

const N_HISTORY_TURNS = 10

// fetch the 10 most recent messages from DB
// add them directly to the LLM messages array
historyMessages: [
  { role: 'user', content: 'that restaurant from before' },
  { role: 'assistant', content: 'You mean OO restaurant in Mangwon?' },
  // ...
]

"Do you remember what I said before?" — this handles that. But anything before 10 turns is invisible. As conversations grow, context gets cut off.


Layer 2 — 60 messages, summarized and injected

So a second layer was added.

Fetch up to 60 messages. The most recent 10 go to Layer 1 directly. The remaining 50 get summarized by Flash-Lite and injected separately.

const TOTAL_FETCH   = 60
const RECENT_WINDOW = 10

const older = messages.slice(0, -RECENT_WINDOW)  // the 50 before the recent 10
// → summarized into 3-5 lines by Flash-Lite
// → "Earlier conversation summary: discussed buying GOOGL at $245.
//    Eunsoo's tutoring schedule came up..."

This summary gets passed in with every request. Even things from before the 10-turn window are compressed and available.

Generating it fresh every time would be slow and expensive. So it's cached.

// in-memory Map — 0ms access within the server process
const summaryMap = new Map<string, CacheEntry>()

// on each request: check Map → return immediately if hit
// on miss: pull from DB and populate Map (cold start, once)
// in background: refresh every 5 minutes (never blocks the main response)

scheduleRefresh()
is fire-and-forget. Even when a new summary is needed, the user doesn't wait. The refreshed version kicks in on the next request.


Layer 3 — embedding-based Vault search

The third layer is different. Not conversation history — stored information.

"What's my average cost on GOOGL?" requires finding the right record in the Vault. Text search has limits. The relevant entry might not contain the word "average cost" at all.

Embeddings handle this. The query is turned into a 768-dimensional vector. All Vault items are stored as vectors too. Cosine similarity finds the closest matches.

const RECALL_LIMIT = 5
const SIMILARITY_THRESHOLD = 0.65

// 1. query → 768-dimensional vector
const queryVector = await embed(query)

// 2. pgvector cosine similarity search
const results = await db.rpc('vault_semantic_search', {
  query_embedding: queryVector,
  similarity_threshold: SIMILARITY_THRESHOLD,
  match_limit: RECALL_LIMIT,
})
// → up to 5 relevant Vault items

Early on, "There's no saved information on your GOOGL average cost" kept appearing. Turned out to be an embedding model bug —

text-embedding-004
was returning 404 on this API key. The
.catch(() => null)
in the code was swallowing the error silently. Took a while to find.


When all three layers stack

By the time Pepper generates a response, all of this is in context:

  • Pepper's personality and family info (persona.md)
  • Earlier conversation summary (Layer 2)
  • Last 10 turns of conversation (Layer 1)
  • Relevant Vault memories (Layer 3, only on recall intent)
  • The current message

LLMs have no memory. But you can build memory and hand it to them each time. Creating the illusion of memory — that's what Pepper's memory architecture actually is.