Machine Learning | NLP | Data Mining
In 2020, for a project at the University of Waterloo, I explored whether song lyrics alone could reveal hidden characteristics of music — specifically, a track’s genre and decade of release.
- Collected and cleaned lyrics for 18,000+ Billboard Top 100 songs using the Genius API, with tokenization, stopword removal, lemmatization, and stemming.
- Engineered features with TF-IDF on a vocabulary of ~20,000 unique words.
- Trained logistic regression models to classify both genre and decade.
- Experimented with unsupervised clustering to detect patterns in lyrical diversity across time.
Key Outcomes
- Lyrics predicted decade more accurately (84%) than genre (76%).
- Distinctive vocabulary gave rap songs the highest accuracy (91%), while country terms like "truck," "beer," and "cowboy" proved strong predictors.
- Clustering highlighted broad trends (e.g., rap’s word density post-1990s) but lacked precision due to genre diversity.
Impact
This project demonstrates how NLP and supervised learning can extract cultural and stylistic signals from raw text data. The insights could power music recommendation engines, cultural analysis, or artist strategy, showing how data science can make the abstract idea of “music taste” more tangible.
If you'd like to see the full report, drop me a line..