Theological Embedding Space
14,729 pieces of Christian content mapped into a shared semantic space. UMAP projection reveals natural theological clusters.
A map of meaning, not keywords
Every sermon, podcast, song, and scripture passage in our catalog has been converted into a mathematical representation of its meaning — called an embedding — using the BAAI/bge-small-en-v1.5 sentence transformer. Think of it as translating each piece of content into a point in 384-dimensional space, where content about similar theology ends up near each other.
The interactive map below projects all 14,729 items from 384 dimensions down to two dimensions using UMAP (Uniform Manifold Approximation and Projection), a technique that preserves local neighborhood structure. Content with similar theological meaning appears close together; content with different meaning appears far apart.
This is the foundation of how Rejoice searches: a sermon about Psalm 23 naturally clusters near a worship song about God as shepherd, even though they use completely different words.
- 6 distinct clusters emerged from unsupervised clustering — the model discovered theological groupings from content metadata alone, without any manual labeling
- Content types intermingle within clusters: sermons and podcasts on the same theological topic appear side-by-side, validating cross-format retrieval
- Music forms its own dedicated cluster (100% music), while the other 5 clusters are primarily spoken-word, reflecting a fundamental distinction between devotional and expository content
- The clusters map to recognizable theological traditions: Reformed/evangelical, charismatic/prophetic, liturgical/Orthodox, scholarly/historical, and practical Christian living
Cluster Analysis
Switch to the "By Cluster" view above to see these groupings. Each cluster was identified through unsupervised clustering on the 384-dimensional embeddings, then manually examined to understand what theological themes the model discovered.
Who ranges, who stays put?
Using the cluster assignments above, we measure each creator's theological footprint — the number of clusters their content touches.
Cluster span as a theological-diversity signal
For every creator (artist or speaker) with five or more items in the catalog, we look up which of the six clusters each item belongs to and count how many distinct clusters that creator's output spans. A single-cluster creator has a focused theological lane — a three- or four-cluster creator ranges across multiple traditions.
Because our clustering was unsupervised, this number isn't a judgement on breadth or depth — it's a shape. Some creators (Reformed teaching networks, liturgical podcasts, worship-only artists) stay firmly in one lane because that's their calling. Others (large multi-site churches, academic programs, denominations with varied output) span several because they're genuinely covering more ground.
- 977 creators have 5+ items in the catalog. Of those, 442 (45%) stay in one cluster, 421 (43%) span two, 109 (11%) span three, and only 5 creators touch four of the six clusters
- The most diverse creators in the catalog — Discerning Hearts Catholic Podcasts, Center for the Study of New Testament Manuscripts, Pastor Lance Ralston — span four clusters, reflecting wide-ranging editorial portfolios that cover Scripture, history, and practical teaching in one feed
- The most focused creators by volume — Tze-John Liu (308 items, all Liturgical & Global), Orthodox Christian Teaching (99% Liturgical), Thomas Babington Macaulay (66 items, all Historical & Scholarly) — illustrate how tradition-bound voices cluster cleanly
- Large multi-site evangelical churches like Elevation, Gateway, Watermark, and Steven Furtick all span exactly two clusters (typically Evangelical & Reformed + Charismatic & Scripture), revealing a consistent bi-modal pattern in the modern megachurch content portfolio
What the Clusters Reveal
The clustering reveals a meaningful theological spectrum that the model discovered without being told what any of these traditions are:
The content-type boundary is real but not absolute. Music forms a distinct cluster because its metadata (album titles, genre tags) differs fundamentally from spoken-word content. But within spoken-word content, the model groups by theological tradition rather than format — a Reformed sermon clusters with Reformed podcasts, not with charismatic sermons.
The largest clusters (2 & 4) represent the two dominant poles of American Protestantism: mainline evangelical teaching and charismatic/prophetic ministry. Together they account for 63% of all content.
The smallest clusters (3 & 5) — historical/scholarly and liturgical/Orthodox — are underrepresented in the catalog but clearly distinct in embedding space, suggesting the model could support these traditions well with more data.
Implications for Cross-Format Discovery
These findings validate Rejoice's approach: because content clusters by theological meaning rather than format, a user listening to a Reformed podcast can be recommended a Reformed sermon on the same topic. The embedding space "knows" they belong together even though they're in different formats. This wouldn't work if content clustered purely by format (all sermons together, all podcasts together).
Methodology & Limitations
Model: BAAI/bge-small-en-v1.5, a general-purpose sentence transformer (384 dimensions). This model was not fine-tuned on theological text, meaning the clusters emerge from general semantic understanding rather than domain-specific training. A theology-adapted model could yield different, potentially more nuanced groupings.
Clustering: Clusters were identified through unsupervised methods on the raw embeddings. No silhouette scores or formal cluster validation metrics are reported — the cluster labels were assigned through manual examination of member content. Future work should include quantitative cluster quality metrics.
Projection: UMAP reduces 384 dimensions to 2 for visualization. This necessarily distorts distances — points that appear nearby in 2D may not be nearest neighbors in the full space. The visualization is best understood as a qualitative overview, not a precise distance map.
Data bias: The catalog is heavily weighted toward English-language evangelical content (podcasts dominate at 79%). Underrepresented traditions (Orthodox, Catholic, non-English) may cluster differently with more balanced data.