Scripture Retrieval Benchmark

Methodology

Data. We iterate over every item in the 14,729-item catalog whose scripture_refs field is non-empty. For each such item, every reference it cites becomes a separate query–item pair. References are normalized to chapter level (Romans 8:28 → Romans 8) for ground-truth matching, and filtered to drop verse-less or clearly timestamp-shaped tokens.

Query. Each distinct reference string (e.g. "Romans 8:28") is encoded once using BAAI/bge-small-en-v1.5 (the same model used throughout the hub) with the retrieval prefix "Represent this sentence for searching relevant passages: ". All unique queries are batched and encoded on the GPU.

Ranking. We score each query against all 14,729 catalog embeddings via cosine similarity, mask out the source item, and take the top-10.

Ground truth. An item is considered relevant to a query if it cites the same normalized chapter-level reference as the query. P@k is computed as the fraction of the top-k that is relevant; the final metrics are the mean across all evaluated pairs.

Exclusions. Query–item pairs with no other item citing the same chapter (i.e., no possible relevant result) are excluded, since precision is undefined without any ground-truth positives. Best/worst lists additionally require ≥3 pairs to avoid single-pair noise.

Known limitations. Chapter-level matching inflates recall relative to verse-level matching (Romans 8:28 and Romans 8:1 count as the same target). The BAAI/bge-small model was not fine-tuned on theological text, so bare chapter references carry relatively little semantic signal and compete against many items whose full descriptions are richer. Music items in the catalog rarely carry explicit scripture_refs, so music is effectively unevaluated here.

Per-book accuracy heatmap

Per content-type breakdown

Per popularity bucket

Top 10 best-retrieving scriptures

Bottom 10 worst-retrieving scriptures

Methodology