PMG: A Private Memory Graph for Persistent AI Memory

Yuting BAI

April 18, 2026

Original

The Private Memory Graph (PMG) is a persistent memory architecture that represents conversation history as a graph. Rather than storing sessions as flat text to be searched, PMG encodes the relationships between conversation segments — both temporal and semantic — so that retrieval follows the structure of past thinking, not just the surface of past words.

Limitations of Summarization-Based Memory

The dominant approach to persistent AI memory is LLM-based summarization: at the end of a session, a language model compresses the conversation into a compact summary, which is stored and later injected into future prompts as context. While conceptually simple, this design carries several structural limitations.

Semantic and detail loss. Summarization is inherently lossy. The model decides at archive time what to preserve and what to discard — without knowledge of what future queries will need. Fine-grained details that appear unremarkable during summarization may prove critical in a later context. Once discarded, they cannot be recovered.

Computational overhead. Generating a high-quality summary requires a capable model. Routing every session through a cloud-based LLM introduces per-session inference cost that compounds rapidly at scale. For a personal assistant handling dozens of sessions per week, this is a non-trivial ongoing expense.

Local model inadequacy. Using a local small model eliminates API cost and external dependency, but introduces a different tradeoff: smaller models often produce summaries that are too compressed, semantically imprecise, or prone to hallucination. The detail-loss problem returns through a different mechanism, alongside the added burden of local model configuration and maintenance.

Privacy exposure. Any architecture that routes conversation history through an external API exposes sensitive personal data to third parties. For a personal AI assistant — which by nature accumulates intimate, contextually rich information — this is a meaningful and often underweighted concern.

PMG is designed to sidestep all four limitations:

Property	Summarization-Based	PMG
Storage fidelity	Lossy — detail discarded at archive time	Lossless — original text preserved verbatim
LLM dependency	Required for summarization	None — only embedding model + FAISS
Privacy	Data sent to external API	Fully local; no external calls required
Compute per session	High — full inference pass	Low — embedding + nearest-neighbor search
Memory portability	Varies by implementation	Self-contained files (text + SQLite + FAISS index); migrate by directory copy
Determinism	Stochastic (LLM output varies)	Deterministic — same query, same retrieval

Two design decisions underpin these properties. First, PMG stores all conversation text verbatim. No information is discarded at archive time; relevance is determined at retrieval time, by the query. Second, graph construction and retrieval rely entirely on embedding models and vector search — operations that are fast, deterministic, and capable of running fully offline on consumer hardware.

The practical consequence is a memory system that is private by default, free of external dependencies, and portable to any machine by copying a self-contained directory.

Architecture Overview

PMG rests on two structural observations about conversation:

Temporal continuity. Within a session, adjacent turns are almost always related. What was said two turns ago constrains what is relevant now. This continuity can be encoded as a bidirectional chain.
Semantic continuity. Topics recur across sessions separated in time. A discussion about a project from last month may be highly relevant to a question today. This continuity can be encoded as weighted edges between semantically similar segments.

PMG layers both topologies onto a single graph. The result is a set of linear chains — one per session — interconnected by a web of cross-chain semantic edges.

Solid edges are session chain edges (temporal). Dashed edges are semantic edges (cross-session).

Graph Construction

Graph construction runs automatically when a session closes. The pipeline proceeds in two parallel tracks after initial segmentation: one builds temporal edges, the other builds semantic edges.

flowchart TD A([Session Ends]) --> B[Archive to Disk] B --> C[Split into Segments\none per turn] C --> D1[Identify Adjacent Pairs\nwithin session] C --> E[Embed Segments\nBGE-m3 · 1024-dim] D1 --> D2[Create Chain Edges\nBidirectional · weight = 1.0] E --> F[FAISS k-NN Search\nk-nearest historical segments] F --> G{Similarity ≥ 0.6?} G -->|Yes| H[Add Semantic Edge\nweight = cosine score] G -->|No| I[Discard] D2 --> J([Persist: SQLite + FAISS]) H --> J

Nodes: Conversation Segments

The atomic unit of PMG is the segment — a single conversation turn (one user message and its assistant response). Each segment carries:

A unique identifier encoding its session and position (seg_<session_id>_<index>)
The full turn text
A creation timestamp
A 1024-dimensional embedding vector (BGE-m3)

All storage, search, and retrieval in PMG operates at the segment level. Sessions themselves are not retrievable units; they are the source from which segments are derived.

Edge Type 1: Session Chain

Within a session, consecutive segments are linked into a chain. These edges are bidirectional and carry a fixed weight of 1.0, reflecting the strong and unambiguous signal of conversational adjacency.

Session chain edges have a specific function during retrieval: they allow the system to expand around a retrieved segment, pulling in the turns immediately before and after it within its original session. A segment retrieved in isolation is often incomplete; its chain neighbors restore the discourse context that gives it meaning.

Edge Type 2: Semantic Edge

When a session closes, each of its segments is searched against the full FAISS index of historical segments. If the cosine similarity between a new segment and a historical segment meets the threshold (≥ 0.6), a directed semantic edge is created — pointing from the newer segment to the older one.

Semantic edges are capped at 20 per node, preventing high-degree segments from distorting graph traversal during retrieval. The similarity threshold (0.6) and edge cap are configurable hyperparameters; their optimal values depend on conversation domain and average session length.

Memory Retrieval

Retrieval translates a natural-language query into a structured context block through three sequential stages: entry point search, graph expansion, and context assembly.

flowchart TD A([Query]) --> B[Embed Query\nBGE-m3] B --> C[FAISS Search\nAll historical segments] C --> D[Apply Filters\nSimilarity ≥ 0.6 · Exclude current session\nDeduplicate against active context] D --> E[Entry Segments\nTop-k = 3] E --> F[Lateral Expansion\nSemantic edges · top-3 neighbors per entry] E --> G[Vertical Expansion\nChain edges · ±2 neighbors per entry] F --> H[Merge Pool\nDeduplicate by segment ID] G --> H H --> I[Sort Chronologically] I --> J([Inject into Prompt])

Stage 1: Entry Point Search

The query is embedded using the same BGE-m3 model as the segments. FAISS returns the approximate nearest neighbors across all historical segments. Three filters are applied before any segment enters the context:

Similarity floor. Only segments with cosine similarity ≥ 0.6 qualify as candidates.
Session exclusion. Segments from the current active session are excluded — retrieval should surface historical context, not echo the present conversation.
Active context deduplication. Candidates with similarity ≥ 0.85 to any turn already in the active context are dropped, avoiding redundant surfacing.

The output is a small set of entry segments: the most relevant historical anchor points for the current query. The retrieval parameters — similarity floor (0.6), deduplication threshold (0.85), and entry count (top-k = 3) — are configurable and may be tuned to balance precision against recall for different use cases.

Stage 2: Graph Expansion

Entry segments alone are rarely sufficient. A retrieved segment is often semantically complete but conversationally incomplete — missing what preceded it, or what it led to. Graph expansion addresses this by traversing the two edge types from each entry.

Lateral expansion follows semantic edges to retrieve thematically related segments from other sessions. This connects the current query to analogous past discussions, even those separated from the entry by weeks or months.

Vertical expansion follows session chain edges to retrieve the two segments immediately before and after each entry within its original session. This reconstructs the local conversational context — the exchange that surrounds a retrieved moment.

Both expansions run in parallel. Their outputs are merged into a single candidate pool.

Stage 3: Context Assembly

The merged pool is deduplicated by segment ID and sorted in ascending order of creation timestamp. The resulting context block reads as a chronological timeline of relevant history, regardless of the order in which segments were discovered during retrieval.

Each segment in the block is labeled with its timestamp, source session, and retrieval role (entry or expanded neighbor). This metadata allows the model to reason about the recency and provenance of each memory, rather than treating all retrieved context as equally current.

Discussion

A simpler design would use pure vector search: embed the query, retrieve top-k segments, inject them directly. PMG adds a graph layer for four reasons.

Contextual completeness. Vector search retrieves the most similar segments, not necessarily the most complete ones. A segment retrieved in isolation may be unintelligible without its conversational neighbors. Session chain expansion ensures that every retrieved memory arrives with enough surrounding context to be useful.

Associative breadth. Semantic edges allow retrieval to discover segments that are related to the query indirectly — through their connection to an entry segment. This single hop of graph traversal substantially broadens retrieved context without requiring a larger initial vector search, and without the precision cost of lowering the similarity threshold.

Transparency and auditability. Because every node in the graph stores raw conversation text, the memory store is fully inspectable. There are no opaque summaries, no lossy compressions, no representations that cannot be traced back to a specific conversation turn. This makes the system debuggable and, more importantly, trustworthy: users can know exactly what the assistant remembers and why.

Incremental construction. The graph is built session by session. Adding a new session requires only embedding its segments and searching against the existing index — no reprocessing of historical data. The graph grows incrementally, with each new session integrating naturally into the existing structure.

Together, these properties position PMG as more than a search index. It is a structured, lossless, and fully private representation of conversational memory — one that retrieves not isolated facts, but coherent fragments of past thinking, situated in time and connected by meaning.

Limitations

No architecture is without tradeoffs, and PMG is no exception.

Cold start. The graph is only as useful as the history it contains. For a new installation with no prior sessions, retrieval returns nothing — the system has no memory to draw from. Retrieval quality improves progressively as the graph accumulates sessions, but the first few interactions receive no benefit from the memory layer. This limitation can be partially mitigated by importing existing conversation history: platforms such as ChatGPT and Claude both support conversation export, and the exported data can be converted to plain text and ingested as seed sessions to bootstrap the graph from day one.

Unbounded graph growth. PMG retains all historical segments indefinitely. There is no forgetting mechanism. Over time, the FAISS index and SQLite graph grow proportionally with usage. For long-running deployments with high conversation volume, this has implications for storage footprint and retrieval latency that must be managed — for example, by archiving or pruning segments beyond a configurable age threshold.

Threshold sensitivity. The similarity thresholds that govern edge creation and retrieval filtering (0.6 for semantic edges, 0.6 for retrieval candidates, 0.85 for deduplication) are static hyperparameters. Their effectiveness depends on the embedding model's geometry and the semantic density of the conversation domain. A threshold well-calibrated for technical discussions may perform poorly on casual or highly varied conversations. Adaptive thresholding remains an open direction.

PMG is an original architecture developed as part of the author's research into persistent memory systems for AI assistants. All design decisions, parameter choices, and retrieval strategies described in this post reflect the current state of the working system.