Cross-Modal Retrieval Study
Can a sermon about forgiveness surface a relevant worship song? Testing semantic retrieval across content formats.
Research Question
Christian audio content exists in silos: sermons on YouTube, worship music on streaming platforms, podcasts in separate apps, and scripture in Bible apps. Can a single semantic embedding space bridge these formats, enabling cross-modal discovery?
Methodology
We encoded 14,729 items (993 sermons, 11,697 podcasts, 1,526 music albums, and 513 scripture passages) using BAAI/bge-small-en-v1.5 into a shared 384-dimensional embedding space. For each content type pair, we sampled 50 query items and measured cosine similarity against all items of the target type. Ground truth relevance was established via shared scripture references and thematic overlap.
Key Results
The embedding space successfully bridges content formats without any explicit cross-modal training:
Cross-Modal Similarity Heatmap
Average cosine similarity of top-10 retrievals between content types. Higher values indicate stronger semantic bridging between formats.
What this means
- Sermons and podcasts are the most semantically similar pair (0.83). Both are spoken-word formats that discuss theology in depth, so the embedding model naturally maps them close together.
- Scripture-to-podcast retrieval scores 0.81 — surprisingly high, showing the model can bridge a Bible passage to a podcast episode that expounds on the same theology.
- Music is the most distinct modality (0.72–0.77 against other types). Music metadata tends to be sparser — album titles and artist names carry less theological signal than full sermon transcripts.
- The matrix shows moderate asymmetry (up to 7% difference between directions). For example, Scripture→Podcast (0.811) scores higher than Podcast→Scripture (0.755), suggesting scripture passages retrieve podcast content more effectively than the reverse — an area for future investigation.
Cross-Modal Retrieval Examples
Real examples of successful cross-format discovery. A user exploring one piece of content could be connected to a semantically related item in a completely different format — without manual curation.
Why This Matters for Rejoice
These results validate the core technical hypothesis behind Rejoice: a single embedding model can understand theological content well enough to recommend across format boundaries. A user listening to a sermon about grace can be shown a worship song about the same theme, a relevant scripture passage, and a deeper-dive podcast episode — all discovered through semantic similarity rather than manual curation.
This is fundamentally different from how most audio platforms work. Spotify, Apple Podcasts, and YouTube each recommend within their own silo. Rejoice bridges those silos by understanding what content means, not just what format it's in.
Limitations & Methodology Notes
No ground truth validation: The showcase examples above were selected as the highest-scoring cross-modal pairs, but no human relevance judgments were collected. Whether a 90% cosine similarity actually corresponds to "relevant" content remains unvalidated. Future work should include expert theological annotations.
Sampling: Similarity scores were computed from 500 sampled query-target pairs per content type pair (not exhaustive). Results may not generalize to tail content.
Model limitations: BAAI/bge-small-en-v1.5 is a general-purpose sentence encoder (384 dimensions), not fine-tuned on theological text. Music metadata is sparser than sermon transcripts, which likely contributes to music being the lowest-similarity modality. A domain-adapted model could improve results, particularly for music and non-English content.
Asymmetry: Retrieval direction matters more than expected — up to 7% difference between query→target vs. target→query for some pairs. This suggests the embedding space is not perfectly calibrated for bidirectional retrieval, and future work should investigate whether this reflects genuine content asymmetries or a model artifact.
No baselines: Results are not compared to simpler methods (BM25, TF-IDF, keyword matching). High cosine similarity scores look impressive but may not represent improvement over basic text matching for content with explicit scripture references.
Can a bare scripture reference find everything that cites it?
A precision-at-K benchmark over 4,133 query–item pairs, scored on GPU in under a second. Not a marketing metric — a floor.
Measuring retrieval the way a search engine would
For every item in the catalog with explicit scripture_refs, we take each reference ("Romans 8:28", "Psalms 23", etc.), encode it with the same BAAI/bge-small-en-v1.5 model used by site search, and rank all 14,729 catalog embeddings by cosine similarity. A retrieved item counts as relevant if it cites the same chapter-level reference as the query — giving us a clean ground truth for Precision@1, @5, and @10.
This is strict. No human judgement, no generous interpretation — only: does the top-K contain items the catalog itself says are about the same passage? A 45% P@1 means: in nearly half of all 4,133 tests, the very top result was another item citing that exact chapter, out of a pool of 14,729.
- Overall P@1 = 44.7%, P@5 = 26.2%, P@10 = 19.7% across 4,133 (item, reference) pairs drawn from 3,450 items — a respectable baseline for a general-purpose encoder that was never fine-tuned on scripture references
- Pauline epistles dominate the leaderboard: Philippians (P@1 75%, P@5 66%), Ephesians (P@1 74%, P@5 52%), Romans (P@1 70%, P@5 51%). Short letters with a tight thematic core cluster remarkably well in embedding space
- Popularity matters. For obscure passages (cited by only 2–3 items) P@5 collapses to 8.2%; for very popular ones (27+ items) it climbs to 39.3% — a 4.8× spread that quantifies the long-tail problem every retrieval system faces
- Querying from a scripture passage itself is the worst source type (P@1 19.4%). Sermons and podcasts, whose descriptions mix the reference with surrounding theology, query 2.3× more effectively — a clear signal that bare chapter strings carry little retrieval signal on their own