About the Data

The Rejoice Research Dataset

14,729 pieces of Christian audio content — sermons, podcasts, worship music, and scripture passages — encoded into a shared semantic space for cross-format discovery research.

Why this research exists

Christian audio content is fragmented across platforms. Sermons live on YouTube, worship music on Spotify, podcasts on Apple Podcasts, and scripture in Bible apps. A listener studying Romans 8 has no way to discover that a worship song, a podcast episode, and a sermon all explore the same theology — unless someone manually curates that connection.

Rejoice Research asks: can machine learning bridge these silos? Can a single AI model understand theological meaning well enough to connect a sermon about grace to a worship song about the same theme, without being explicitly taught what "grace" means?

This dataset is the foundation for answering that question. Every item has been encoded into a 384-dimensional embedding using a sentence transformer, creating a shared mathematical space where theological similarity can be measured across formats. The six research projects on this site explore different facets of what that embedding space reveals.

Dataset at a Glance

14,729

Total Items

1,982

Unique Creators

2,777

Scripture Passages

924

Unique Themes

Content Composition

The dataset spans four content formats, with podcasts making up the majority. Every item includes a title, creator, description, and content type. Scripture references and thematic tags are available for a subset.

Podcasts 79.4%

Music

Sermons

Scripture

11,697

79.4% of dataset

Podcasts

Christian podcast episodes spanning practical theology, biblical teaching, devotionals, church history, and topical discussions. Sources range from major ministries to independent creators across evangelical, Reformed, charismatic, Orthodox, and Catholic traditions.

1,526

10.4% of dataset

Music

Christian music albums and recordings across genres: contemporary Christian, gospel, hip hop, rock, praise & worship, Southern gospel, metal, and hymns. Metadata includes album title, artist, genre tags, and descriptions.

993

6.7% of dataset

Sermons

Sermons from churches and ministries including Elevation Church, Saddleback, Gateway, Newspring, and hundreds of smaller congregations. Each includes title, speaker/church, description, and often scripture references.

513

3.5% of dataset

Scripture Passages

Bible chapter-level passages spanning both Old and New Testaments. These serve as anchor points in the embedding space, enabling content to be connected through shared biblical references.

Metadata Coverage

Not every item has the same level of metadata. Here's what's available:

Field	Coverage	Items
Title	All items	14,729
Description	All items	14,729
Creator / Artist	96% of items	14,131
Scripture References	23.4% of items	3,450
Theme / Genre Tags	Primarily music	~2,500
384-dim Embedding	All items	14,729

Creator / Source	Tradition	Items
Tze-John Liu	Orthodox / Global	308
Loyal Books	Historical / Scholarly	233
Orthodox Christian Teaching	Eastern Orthodox	214
Various Artists	Christian Music	144
LibriVox	Historical / Audiobook	99
American Family Association	Evangelical	84
Discerning Hearts	Catholic	66
CSNTM	Manuscript Studies	63
AudioVerse	Adventist	52
1611 KJV Prophecy Church	Prophetic / Eschatological	47

How the Data Was Processed

Each item in the catalog was encoded into a 384-dimensional embedding vector using BAAI/bge-small-en-v1.5, a general-purpose sentence transformer. The encoding process:

Text extraction

Title, description, artist, and available metadata concatenated into a single text string per item

Sentence encoding

Each text string encoded via bge-small into a 384-dimensional dense vector

L2 normalization

All vectors normalized to unit length, enabling cosine similarity via simple dot product

Shared space

All 14,729 items from all 4 formats live in the same vector space, enabling cross-modal similarity search

Known Biases & Limitations

Format imbalance: Podcasts represent 79% of the dataset. This skews clustering, similarity averages, and search results toward podcast content. Music, sermons, and scripture are underrepresented relative to their real-world volume.

Language: The dataset is predominantly English-language, with some Chinese, Spanish, and Filipino content in the Liturgical/Global cluster. Non-English content may be poorly encoded by the English-trained model.

Tradition bias: Contemporary evangelical and charismatic content dominates. Mainline Protestant, Catholic, and Orthodox traditions are represented but are minority voices in the dataset.

Metadata quality: Music metadata tends to be sparser (album titles + artist names) compared to sermons and podcasts (full episode descriptions). This likely contributes to music scoring lower on cross-modal similarity metrics.

No audio features: Only text metadata (titles, descriptions, tags) is encoded. Actual audio content is not analyzed — a sermon's tone, a song's melody, or a speaker's style are invisible to the model.

Research Projects Using This Data

Every project on this site uses the same underlying dataset and embeddings. Here's how each one builds on the data: