Slides for the project are available on Google Slides or in PDF here!
A data science project that will attempt to determine if the lyrical content of a song can predict if it will hit the Billboard Year-End Hot 100 singles. The project will intersect several datasets to create a final dataframe that will consist of songs that charted and those that did not chart, with each comprising almost 50% of the set, along with the bag of words version of their lyrics and the analyses on them, such as sentiment analysis, frequency of obscene words, frequency of words pertaining to certain themes, total number of unique words, etc. and the year they charted. The dataframe will also include the last column 'charted', a binary variable that corresponds to the chart status of the song.
-
Track information
-
Year (int)
-
Decade (int)
-
Lyrical content
-
Unique Words, w/o stopwords (int)
-
Density, w/o stopwords (int)
-
Unique Words, w/ stopwords (int)
-
Density, w/ stopwords (int)
-
Nouns (int)
-
Verbs (int)
-
Adjectives (int)
-
Syllables (int)
-
Most used term (string)
-
Most used frequency (int)
-
Curses (binary)
-
Total curses (int)
-
Reading score (float)
-
Sentiment (float)
-
Chart
-
Charted (binary)
src
├── data; the datasets for the project
├── code; scripts to build the datasets
└── assets; static files and docs
23 directories, 60 files