Cross-Modal Retrieval Study

Can a sermon about forgiveness surface a relevant worship song? Testing semantic retrieval across content formats.

Research Question

Christian audio content exists in silos: sermons on YouTube, worship music on streaming platforms, podcasts in separate apps, and scripture in Bible apps. Can a single semantic embedding space bridge these formats, enabling cross-modal discovery?

Methodology

We encoded 14,729 items (993 sermons, 11,697 podcasts, 1,526 music albums, and 513 scripture passages) using BAAI/bge-small-en-v1.5 into a shared 384-dimensional embedding space. For each content type pair, we sampled 50 query items and measured cosine similarity against all items of the target type. Ground truth relevance was established via shared scripture references and thematic overlap.

Key Results

The embedding space successfully bridges content formats without any explicit cross-modal training:

0.830
Sermon→Podcast
0.739
Sermon→Music
0.797
Sermon→Scripture
12
Cross-Modal Pairs Tested

Cross-Modal Similarity Heatmap

Average cosine similarity of top-10 retrievals between content types. Higher values indicate stronger semantic bridging between formats.

Sermon
Music
Podcast
Scripture
Sermon
0.739
S→M
0.830
S→P
0.797
S→Sc
Music
0.767
M→S
0.809
M→P
0.750
M→Sc
Podcast
0.798
P→S
0.724
P→M
0.755
P→Sc
Scripture
0.804
Sc→S
0.734
Sc→M
0.811
Sc→P

What this means

  • Sermons and podcasts are the most semantically similar pair (0.83). Both are spoken-word formats that discuss theology in depth, so the embedding model naturally maps them close together.
  • Scripture-to-podcast retrieval scores 0.81 — surprisingly high, showing the model can bridge a Bible passage to a podcast episode that expounds on the same theology.
  • Music is the most distinct modality (0.72–0.77 against other types). Music metadata tends to be sparser — album titles and artist names carry less theological signal than full sermon transcripts.
  • The matrix shows moderate asymmetry (up to 7% difference between directions). For example, Scripture→Podcast (0.811) scores higher than Podcast→Scripture (0.755), suggesting scripture passages retrieve podcast content more effectively than the reverse — an area for future investigation.

Cross-Modal Retrieval Examples

Real examples of successful cross-format discovery. A user exploring one piece of content could be connected to a semantically related item in a completely different format — without manual curation.

Sermon query
How Do You Love Enemies?
Bayside Church
Podcast match
Devotions from Psalms: Ep. 141 — Guard my heart from evil
Thomas Skaria
90%
similarity
Podcast query
Pizza Theology: Hearing From God
Fellowship of Christian University Students
Sermon match
We've Been Invited to Play a Part in God's Story
Newspring Church
90%
similarity
Scripture query
Revelation 22
 
Podcast match
Bible Prophecy Master Class, Part 1: Rev. 22:1-21
Alvin Mitchell
90%
similarity
Sermon query
Navigating The Bible: Romans
Saddleback Church
Scripture match
Romans 10
 
89%
similarity
Scripture query
Isaiah 33
 
Sermon match
Midweek Bible Study: Isaiah Lesson 10
Glad Tidings Church Muncie
87%
similarity

Why This Matters for Rejoice

These results validate the core technical hypothesis behind Rejoice: a single embedding model can understand theological content well enough to recommend across format boundaries. A user listening to a sermon about grace can be shown a worship song about the same theme, a relevant scripture passage, and a deeper-dive podcast episode — all discovered through semantic similarity rather than manual curation.

This is fundamentally different from how most audio platforms work. Spotify, Apple Podcasts, and YouTube each recommend within their own silo. Rejoice bridges those silos by understanding what content means, not just what format it's in.

Limitations & Methodology Notes

No ground truth validation: The showcase examples above were selected as the highest-scoring cross-modal pairs, but no human relevance judgments were collected. Whether a 90% cosine similarity actually corresponds to "relevant" content remains unvalidated. Future work should include expert theological annotations.

Sampling: Similarity scores were computed from 500 sampled query-target pairs per content type pair (not exhaustive). Results may not generalize to tail content.

Model limitations: BAAI/bge-small-en-v1.5 is a general-purpose sentence encoder (384 dimensions), not fine-tuned on theological text. Music metadata is sparser than sermon transcripts, which likely contributes to music being the lowest-similarity modality. A domain-adapted model could improve results, particularly for music and non-English content.

Asymmetry: Retrieval direction matters more than expected — up to 7% difference between query→target vs. target→query for some pairs. This suggests the embedding space is not perfectly calibrated for bidirectional retrieval, and future work should investigate whether this reflects genuine content asymmetries or a model artifact.

No baselines: Results are not compared to simpler methods (BM25, TF-IDF, keyword matching). High cosine similarity scores look impressive but may not represent improvement over basic text matching for content with explicit scripture references.

Can a bare scripture reference find everything that cites it?

A precision-at-K benchmark over 4,133 query–item pairs, scored on GPU in under a second. Not a marketing metric — a floor.

What You're Seeing

Measuring retrieval the way a search engine would

For every item in the catalog with explicit scripture_refs, we take each reference ("Romans 8:28", "Psalms 23", etc.), encode it with the same BAAI/bge-small-en-v1.5 model used by site search, and rank all 14,729 catalog embeddings by cosine similarity. A retrieved item counts as relevant if it cites the same chapter-level reference as the query — giving us a clean ground truth for Precision@1, @5, and @10.

This is strict. No human judgement, no generous interpretation — only: does the top-K contain items the catalog itself says are about the same passage? A 45% P@1 means: in nearly half of all 4,133 tests, the very top result was another item citing that exact chapter, out of a pool of 14,729.

Key Findings
  • Overall P@1 = 44.7%, P@5 = 26.2%, P@10 = 19.7% across 4,133 (item, reference) pairs drawn from 3,450 items — a respectable baseline for a general-purpose encoder that was never fine-tuned on scripture references
  • Pauline epistles dominate the leaderboard: Philippians (P@1 75%, P@5 66%), Ephesians (P@1 74%, P@5 52%), Romans (P@1 70%, P@5 51%). Short letters with a tight thematic core cluster remarkably well in embedding space
  • Popularity matters. For obscure passages (cited by only 2–3 items) P@5 collapses to 8.2%; for very popular ones (27+ items) it climbs to 39.3% — a 4.8× spread that quantifies the long-tail problem every retrieval system faces
  • Querying from a scripture passage itself is the worst source type (P@1 19.4%). Sermons and podcasts, whose descriptions mix the reference with surrounding theology, query 2.3× more effectively — a clear signal that bare chapter strings carry little retrieval signal on their own