The Rejoice Research Dataset
14,729 pieces of Christian audio content — sermons, podcasts, worship music, and scripture passages — encoded into a shared semantic space for cross-format discovery research.
Why this research exists
Christian audio content is fragmented across platforms. Sermons live on YouTube, worship music on Spotify, podcasts on Apple Podcasts, and scripture in Bible apps. A listener studying Romans 8 has no way to discover that a worship song, a podcast episode, and a sermon all explore the same theology — unless someone manually curates that connection.
Rejoice Research asks: can machine learning bridge these silos? Can a single AI model understand theological meaning well enough to connect a sermon about grace to a worship song about the same theme, without being explicitly taught what "grace" means?
This dataset is the foundation for answering that question. Every item has been encoded into a 384-dimensional embedding using a sentence transformer, creating a shared mathematical space where theological similarity can be measured across formats. The six research projects on this site explore different facets of what that embedding space reveals.
Dataset at a Glance
Content Composition
The dataset spans four content formats, with podcasts making up the majority. Every item includes a title, creator, description, and content type. Scripture references and thematic tags are available for a subset.
Metadata Coverage
Not every item has the same level of metadata. Here's what's available:
| Field | Coverage | Items |
|---|---|---|
| Title | All items | 14,729 |
| Description | All items | 14,729 |
| Creator / Artist | 96% of items | 14,131 |
| Scripture References | 23.4% of items | 3,450 |
| Theme / Genre Tags | Primarily music | ~2,500 |
| 384-dim Embedding | All items | 14,729 |
Top Content Sources
The dataset draws from nearly 2,000 unique creators. The largest contributors reflect the diversity of Christian traditions represented:
| Creator / Source | Tradition | Items |
|---|---|---|
| Tze-John Liu | Orthodox / Global | 308 |
| Loyal Books | Historical / Scholarly | 233 |
| Orthodox Christian Teaching | Eastern Orthodox | 214 |
| Various Artists | Christian Music | 144 |
| LibriVox | Historical / Audiobook | 99 |
| American Family Association | Evangelical | 84 |
| Discerning Hearts | Catholic | 66 |
| CSNTM | Manuscript Studies | 63 |
| AudioVerse | Adventist | 52 |
| 1611 KJV Prophecy Church | Prophetic / Eschatological | 47 |
How the Data Was Processed
Each item in the catalog was encoded into a 384-dimensional embedding vector using BAAI/bge-small-en-v1.5, a general-purpose sentence transformer. The encoding process:
Known Biases & Limitations
Format imbalance: Podcasts represent 79% of the dataset. This skews clustering, similarity averages, and search results toward podcast content. Music, sermons, and scripture are underrepresented relative to their real-world volume.
Language: The dataset is predominantly English-language, with some Chinese, Spanish, and Filipino content in the Liturgical/Global cluster. Non-English content may be poorly encoded by the English-trained model.
Tradition bias: Contemporary evangelical and charismatic content dominates. Mainline Protestant, Catholic, and Orthodox traditions are represented but are minority voices in the dataset.
Metadata quality: Music metadata tends to be sparser (album titles + artist names) compared to sermons and podcasts (full episode descriptions). This likely contributes to music scoring lower on cross-modal similarity metrics.
No audio features: Only text metadata (titles, descriptions, tags) is encoded. Actual audio content is not analyzed — a sermon's tone, a song's melody, or a speaker's style are invisible to the model.
Research Projects Using This Data
Every project on this site uses the same underlying dataset and embeddings. Here's how each one builds on the data: