The Rejoice Research Dataset

14,729 pieces of Christian audio content — sermons, podcasts, worship music, and scripture passages — encoded into a shared semantic space for cross-format discovery research.

Why this research exists

Christian audio content is fragmented across platforms. Sermons live on YouTube, worship music on Spotify, podcasts on Apple Podcasts, and scripture in Bible apps. A listener studying Romans 8 has no way to discover that a worship song, a podcast episode, and a sermon all explore the same theology — unless someone manually curates that connection.

Rejoice Research asks: can machine learning bridge these silos? Can a single AI model understand theological meaning well enough to connect a sermon about grace to a worship song about the same theme, without being explicitly taught what "grace" means?

This dataset is the foundation for answering that question. Every item has been encoded into a 384-dimensional embedding using a sentence transformer, creating a shared mathematical space where theological similarity can be measured across formats. The six research projects on this site explore different facets of what that embedding space reveals.

Dataset at a Glance

14,729
Total Items
1,982
Unique Creators
2,777
Scripture Passages
924
Unique Themes

Content Composition

The dataset spans four content formats, with podcasts making up the majority. Every item includes a title, creator, description, and content type. Scripture references and thematic tags are available for a subset.

Podcasts 79.4%
Music
Sermons
Scripture
11,697
79.4% of dataset
Podcasts
Christian podcast episodes spanning practical theology, biblical teaching, devotionals, church history, and topical discussions. Sources range from major ministries to independent creators across evangelical, Reformed, charismatic, Orthodox, and Catholic traditions.
1,526
10.4% of dataset
Music
Christian music albums and recordings across genres: contemporary Christian, gospel, hip hop, rock, praise & worship, Southern gospel, metal, and hymns. Metadata includes album title, artist, genre tags, and descriptions.
993
6.7% of dataset
Sermons
Sermons from churches and ministries including Elevation Church, Saddleback, Gateway, Newspring, and hundreds of smaller congregations. Each includes title, speaker/church, description, and often scripture references.
513
3.5% of dataset
Scripture Passages
Bible chapter-level passages spanning both Old and New Testaments. These serve as anchor points in the embedding space, enabling content to be connected through shared biblical references.

Metadata Coverage

Not every item has the same level of metadata. Here's what's available:

Field Coverage Items
TitleAll items14,729
DescriptionAll items14,729
Creator / Artist96% of items14,131
Scripture References23.4% of items3,450
Theme / Genre TagsPrimarily music~2,500
384-dim EmbeddingAll items14,729

Top Content Sources

The dataset draws from nearly 2,000 unique creators. The largest contributors reflect the diversity of Christian traditions represented:

Creator / Source Tradition Items
Tze-John LiuOrthodox / Global308
Loyal BooksHistorical / Scholarly233
Orthodox Christian TeachingEastern Orthodox214
Various ArtistsChristian Music144
LibriVoxHistorical / Audiobook99
American Family AssociationEvangelical84
Discerning HeartsCatholic66
CSNTMManuscript Studies63
AudioVerseAdventist52
1611 KJV Prophecy ChurchProphetic / Eschatological47

How the Data Was Processed

Each item in the catalog was encoded into a 384-dimensional embedding vector using BAAI/bge-small-en-v1.5, a general-purpose sentence transformer. The encoding process:

1
Text extraction
Title, description, artist, and available metadata concatenated into a single text string per item
2
Sentence encoding
Each text string encoded via bge-small into a 384-dimensional dense vector
3
L2 normalization
All vectors normalized to unit length, enabling cosine similarity via simple dot product
4
Shared space
All 14,729 items from all 4 formats live in the same vector space, enabling cross-modal similarity search

Known Biases & Limitations

Format imbalance: Podcasts represent 79% of the dataset. This skews clustering, similarity averages, and search results toward podcast content. Music, sermons, and scripture are underrepresented relative to their real-world volume.

Language: The dataset is predominantly English-language, with some Chinese, Spanish, and Filipino content in the Liturgical/Global cluster. Non-English content may be poorly encoded by the English-trained model.

Tradition bias: Contemporary evangelical and charismatic content dominates. Mainline Protestant, Catholic, and Orthodox traditions are represented but are minority voices in the dataset.

Metadata quality: Music metadata tends to be sparser (album titles + artist names) compared to sermons and podcasts (full episode descriptions). This likely contributes to music scoring lower on cross-modal similarity metrics.

No audio features: Only text metadata (titles, descriptions, tags) is encoded. Actual audio content is not analyzed — a sermon's tone, a song's melody, or a speaker's style are invisible to the model.

Research Projects Using This Data

Every project on this site uses the same underlying dataset and embeddings. Here's how each one builds on the data: