Your vector database is growing.
Is it getting smarter?
Stop guessing your corpus health. Corpulse plugs into the stack you already run, combining Postgres-backed analytics with Qdrant retrieval instrumentation to surface ghosts, duplicates, obsolete versions, stale embeddings, and low-engagement suspects.
$ pip install corpulse[qdrant]Vector Database Entropy
As you add more data to your RAG application, retrieval quality doesn't just stay the same. It degrades silently. This is Vector Database Entropy.
- Ghost Documents. Forgotten indices that still surface in results, polluting your LLM context and increasing hallucination rates.
- Stale Chunks. Old versions of data that compete with fresh information, causing your agent to provide outdated or conflicting answers.
- Retrieval Drift. Subtle changes in your embedding space that slowly pull your retrieval accuracy down as the corpus grows.
Fitness tracker, not a grade
Static evaluation tools give you a grade on a specific test set. Corpulse is different.
Think of us as a fitness tracker for your corpus. We don't just tell you if you passed today's test; we provide continuous health signals from real-world usage. You keep your own vector database and application flow, while Corpulse adds observability around indexing, retrieval, engagement, freshness, and drift.
What it Measures
Corpulse turns registered documents, retrieval logs, and engagement events into health signals that impact precision, freshness, and cost.
Ghosts
Registered documents with no retrieval activity in the last 30 days, often pointing at dead or forgotten content.
Duplicates
Near-duplicate documents above the default 0.92 similarity threshold that waste context and confuse ranking.
Obsolete
Older versioned files superseded by newer ones, such as v1 versus v2 of the same spec or policy.
Stale
Documents whose source changed after embedding and are now more than 14 days behind the source of truth.
Suspects
Documents retrieved at least 5 times in the last 30 days but engaged with less than 15% of the time.
Integrate without replacing your stack.
Use AsyncPostgresBackend for analytics storage, add the Qdrant wrapper for automatic retrieval logging, and keep explicit control over registration and engagement events.
from corpulse import AsyncCorpulse
from corpulse.backends import AsyncPostgresBackend
backend = await AsyncPostgresBackend.create(
"postgresql://user:pass@localhost/corpulse"
)
corpulse = AsyncCorpulse(backend=backend)
await corpulse.register_document("doc-123", "manual.pdf")
await corpulse.log_retrieval(
[{"doc_id": "doc-123", "filename": "manual.pdf", "score": 0.91}],
query="How do I set this up?",
)
await corpulse.log_engagement("doc-123", event="opened")Analysis Methods
Explicit APIs for registration, retrieval analytics, engagement tracking, and health reporting.
| Method | Description | Input | Output |
|---|---|---|---|
| get_ghosts() | Returns registered documents with no retrievals inside the 30-day ghost window. | None | List[GhostItem] |
| get_duplicates() | Finds near-duplicate document pairs by cosine similarity using a 0.92 default threshold. | threshold: float | None | List[DuplicatePair] |
| get_obsolete() | Marks older versioned filenames as obsolete when a newer vN document exists. | None | List[ObsoleteItem] |
| get_stale_embeddings() | Returns documents whose source_updated_at is more than 14 days newer than embedded_at. | None | List[StaleItem] |
| get_suspects() | Flags documents with at least 5 retrievals and less than 15% engagement over the window. | window_days: int | None | List[SuspectItem] |
| report() | Builds a summary plus top document rows so you can drive a dashboard directly. | window_days: int | None | HealthReport |