Your vector database is growing. Is it getting smarter?

Stop guessing your corpus health. Corpulse plugs into the stack you already run, combining Postgres-backed analytics with Qdrant retrieval instrumentation to surface ghosts, duplicates, obsolete versions, stale embeddings, and low-engagement suspects.

Terminal
$ pip install corpulse[qdrant]
The Problem

Vector Database Entropy

As you add more data to your RAG application, retrieval quality doesn't just stay the same. It degrades silently. This is Vector Database Entropy.

  • Ghost Documents. Forgotten indices that still surface in results, polluting your LLM context and increasing hallucination rates.
  • Stale Chunks. Old versions of data that compete with fresh information, causing your agent to provide outdated or conflicting answers.
  • Retrieval Drift. Subtle changes in your embedding space that slowly pull your retrieval accuracy down as the corpus grows.
Our Philosophy

Fitness tracker, not a grade

Static evaluation tools give you a grade on a specific test set. Corpulse is different.

Think of us as a fitness tracker for your corpus. We don't just tell you if you passed today's test; we provide continuous health signals from real-world usage. You keep your own vector database and application flow, while Corpulse adds observability around indexing, retrieval, engagement, freshness, and drift.

"Because a static score is just a snapshot. Health is a trajectory."

What it Measures

Corpulse turns registered documents, retrieval logs, and engagement events into health signals that impact precision, freshness, and cost.

Ghosts

Registered documents with no retrieval activity in the last 30 days, often pointing at dead or forgotten content.

Duplicates

Near-duplicate documents above the default 0.92 similarity threshold that waste context and confuse ranking.

Obsolete

Older versioned files superseded by newer ones, such as v1 versus v2 of the same spec or policy.

Stale

Documents whose source changed after embedding and are now more than 14 days behind the source of truth.

Suspects

Documents retrieved at least 5 times in the last 30 days but engaged with less than 15% of the time.

Integrate without replacing your stack.

Use AsyncPostgresBackend for analytics storage, add the Qdrant wrapper for automatic retrieval logging, and keep explicit control over registration and engagement events.

from corpulse import AsyncCorpulse
from corpulse.backends import AsyncPostgresBackend

backend = await AsyncPostgresBackend.create(
    "postgresql://user:pass@localhost/corpulse"
)
corpulse = AsyncCorpulse(backend=backend)

await corpulse.register_document("doc-123", "manual.pdf")
await corpulse.log_retrieval(
    [{"doc_id": "doc-123", "filename": "manual.pdf", "score": 0.91}],
    query="How do I set this up?",
)
await corpulse.log_engagement("doc-123", event="opened")

Analysis Methods

Explicit APIs for registration, retrieval analytics, engagement tracking, and health reporting.

MethodDescriptionInputOutput
get_ghosts()Returns registered documents with no retrievals inside the 30-day ghost window.NoneList[GhostItem]
get_duplicates()Finds near-duplicate document pairs by cosine similarity using a 0.92 default threshold.threshold: float | NoneList[DuplicatePair]
get_obsolete()Marks older versioned filenames as obsolete when a newer vN document exists.NoneList[ObsoleteItem]
get_stale_embeddings()Returns documents whose source_updated_at is more than 14 days newer than embedded_at.NoneList[StaleItem]
get_suspects()Flags documents with at least 5 retrievals and less than 15% engagement over the window.window_days: int | NoneList[SuspectItem]
report()Builds a summary plus top document rows so you can drive a dashboard directly.window_days: int | NoneHealthReport