Dataset
LongMemEval-S (ICLR 2025)
LongMemEval-S is an academic benchmark accepted at ICLR 2025 for evaluating memory in long-term multi-session chat systems. Each entry contains approximately 53 conversation sessions flattened into a single haystack averaging 500 messages and 122,404 tokens. Purgr was run against 50 entries with identical configuration (activeWindow: 8, anchorCount: 3, scorerMode: 'dmd'). A fresh Purgr instance was used per entry. This benchmark was not designed for compression tools — it was designed for retrieval memory systems. The primary metrics here are TRR and latency on real third-party conversational data at scale.
| Metric |
Result |
| Entries tested | 50 of 500 |
| Mean original tokens | 122,404 |
| Mean compressed tokens | 22,882 |
| Mean TRR | 81.3% |
| Mean original messages | 501 |
| Mean compressed messages | 116 |
| Message reduction | 76.9% |
| Mean compression latency | 346ms |
| Min latency | 291ms |
| Max latency | 389ms |
TRR Distribution across 50 runs
LongMemEval-S was designed to test retrieval memory systems — systems that explicitly store and index facts for later lookup. Purgr is a compression engine, not a retrieval system. The relevant metrics here are TRR and latency: Purgr reduces 122K-token multi-session conversations to 22K tokens in 346ms while preserving conversation structure. For structured fact preservation, see the NIAH and Competitive benchmarks above.
Arbitrary conversational detail survival rate on LongMemEval-S is 38%. This reflects the fundamental design of Purgr — fact protection is triggered by high-specificity signals: currency values, precise dates, version strings, named persons with titles. General conversational details like pet names, food preferences, and casual references do not trigger protection. Purgr Semantic (roadmap) will address this use case via embedding-based importance scoring.
Citation
Dataset: LongMemEval-S (xiaowu0162/longmemeval-cleaned, Apache 2.0). Benchmark accepted at ICLR 2025. Citation: Wu et al., LongMemEval: Benchmarking Chat Assistants on Long-Term Interactive Memory, ICLR 2025.