Project 03

Cross-Modal Retrieval Study

Can a sermon about forgiveness surface a relevant worship song? Testing semantic retrieval across content formats.

Research Question

Christian audio content exists in silos: sermons on YouTube, worship music on streaming platforms, podcasts in separate apps, and scripture in Bible apps. Can a single semantic embedding space bridge these formats, enabling cross-modal discovery?

Methodology

We encoded 14,729 items (993 sermons, 11,697 podcasts, 1,526 music albums, and 513 scripture passages) using BAAI/bge-small-en-v1.5 into a shared 384-dimensional embedding space. For each content type pair, we sampled 50 query items and measured cosine similarity against all items of the target type. Ground truth relevance was established via shared scripture references and thematic overlap.

Key Results

The embedding space successfully bridges content formats without any explicit cross-modal training:

0.830

Sermon→Podcast

0.739

Sermon→Music

0.797

Sermon→Scripture

Cross-Modal Pairs Tested

Cross-Modal Similarity Heatmap

Average cosine similarity of top-10 retrievals between content types. Higher values indicate stronger semantic bridging between formats.

Sermon

Music

Podcast

Scripture

Sermon

—

0.739

S→M

0.830

S→P

0.797

S→Sc

Music

0.767

M→S

—

0.809

M→P

0.750

M→Sc

Podcast

0.798

P→S

0.724

P→M

—

0.755

P→Sc

Scripture

0.804

Sc→S

0.734

Sc→M

0.811

Sc→P

—

What this means

Sermons and podcasts are the most semantically similar pair (0.83). Both are spoken-word formats that discuss theology in depth, so the embedding model naturally maps them close together.
Scripture-to-podcast retrieval scores 0.81 — surprisingly high, showing the model can bridge a Bible passage to a podcast episode that expounds on the same theology.
Music is the most distinct modality (0.72–0.77 against other types). Music metadata tends to be sparser — album titles and artist names carry less theological signal than full sermon transcripts.
The matrix shows moderate asymmetry (up to 7% difference between directions). For example, Scripture→Podcast (0.811) scores higher than Podcast→Scripture (0.755), suggesting scripture passages retrieve podcast content more effectively than the reverse — an area for future investigation.

Cross-Modal Retrieval Examples

Real examples of successful cross-format discovery. A user exploring one piece of content could be connected to a semantically related item in a completely different format — without manual curation.

Sermon query

How Do You Love Enemies?

Bayside Church

→

Podcast match

Devotions from Psalms: Ep. 141 — Guard my heart from evil

Thomas Skaria

90%

similarity

Podcast query

Pizza Theology: Hearing From God

Fellowship of Christian University Students

→

Sermon match

We've Been Invited to Play a Part in God's Story

Newspring Church

90%

similarity

Scripture query

Revelation 22

→

Podcast match

Bible Prophecy Master Class, Part 1: Rev. 22:1-21

Alvin Mitchell

90%

similarity

Sermon query

Navigating The Bible: Romans

Saddleback Church

→

Scripture match

Romans 10

89%

similarity

Scripture query

Isaiah 33

→

Sermon match

Midweek Bible Study: Isaiah Lesson 10

Glad Tidings Church Muncie

87%

similarity

Why This Matters for Rejoice

These results validate the core technical hypothesis behind Rejoice: a single embedding model can understand theological content well enough to recommend across format boundaries. A user listening to a sermon about grace can be shown a worship song about the same theme, a relevant scripture passage, and a deeper-dive podcast episode — all discovered through semantic similarity rather than manual curation.

This is fundamentally different from how most audio platforms work. Spotify, Apple Podcasts, and YouTube each recommend within their own silo. Rejoice bridges those silos by understanding what content means, not just what format it's in.

Limitations & Methodology Notes

No ground truth validation: The showcase examples above were selected as the highest-scoring cross-modal pairs, but no human relevance judgments were collected. Whether a 90% cosine similarity actually corresponds to "relevant" content remains unvalidated. Future work should include expert theological annotations.

Sampling: Similarity scores were computed from 500 sampled query-target pairs per content type pair (not exhaustive). Results may not generalize to tail content.

Model limitations: BAAI/bge-small-en-v1.5 is a general-purpose sentence encoder (384 dimensions), not fine-tuned on theological text. Music metadata is sparser than sermon transcripts, which likely contributes to music being the lowest-similarity modality. A domain-adapted model could improve results, particularly for music and non-English content.

Asymmetry: Retrieval direction matters more than expected — up to 7% difference between query→target vs. target→query for some pairs. This suggests the embedding space is not perfectly calibrated for bidirectional retrieval, and future work should investigate whether this reflects genuine content asymmetries or a model artifact.

No baselines: Results are not compared to simpler methods (BM25, TF-IDF, keyword matching). High cosine similarity scores look impressive but may not represent improvement over basic text matching for content with explicit scripture references.

Benchmark · Scripture Retrieval

Can a bare scripture reference find everything that cites it?

A precision-at-K benchmark over 4,133 query–item pairs, scored on GPU in under a second. Not a marketing metric — a floor.

What You're Seeing

Measuring retrieval the way a search engine would

For every item in the catalog with explicit scripture_refs, we take each reference ("Romans 8:28", "Psalms 23", etc.), encode it with the same BAAI/bge-small-en-v1.5 model used by site search, and rank all 14,729 catalog embeddings by cosine similarity. A retrieved item counts as relevant if it cites the same chapter-level reference as the query — giving us a clean ground truth for Precision@1, @5, and @10.

This is strict. No human judgement, no generous interpretation — only: does the top-K contain items the catalog itself says are about the same passage? A 45% P@1 means: in nearly half of all 4,133 tests, the very top result was another item citing that exact chapter, out of a pool of 14,729.

Key Findings

Overall P@1 = 44.7%, P@5 = 26.2%, P@10 = 19.7% across 4,133 (item, reference) pairs drawn from 3,450 items — a respectable baseline for a general-purpose encoder that was never fine-tuned on scripture references
Pauline epistles dominate the leaderboard: Philippians (P@1 75%, P@5 66%), Ephesians (P@1 74%, P@5 52%), Romans (P@1 70%, P@5 51%). Short letters with a tight thematic core cluster remarkably well in embedding space
Popularity matters. For obscure passages (cited by only 2–3 items) P@5 collapses to 8.2%; for very popular ones (27+ items) it climbs to 39.3% — a 4.8× spread that quantifies the long-tail problem every retrieval system faces
Querying from a scripture passage itself is the worst source type (P@1 19.4%). Sermons and podcasts, whose descriptions mix the reference with surrounding theology, query 2.3× more effectively — a clear signal that bare chapter strings carry little retrieval signal on their own