Is Beyoncé better than Queen?

Step by step data analysis. Also: What is sentiment analysis? How to create a wordcloud? What is wrong with Beyoncé?

Hey guys! I was recently scrolling through some memes and came across this:

You have surely seen this before, haven't you? I like Beyoncé. Like, a lot. But I'm also aware that nowadays popular music is not so sophisticated. Is it? What a great question we can answer right now with a little analytic skills! Let's go and check, if Freddie Mercury's texts are FAR better than those of Beyoncé (what my boyfriend is constantly pointing out to me, so yes, I will be doing this analysis only to throw it at his face saying "AHHHAAAAAA!". Back to the post now.)

What should we do:

collect and explore the data
clear the data
analyze the data
present the results

What will we need?

Lyrics of Freddie Mercury’s songs
Lyrics of Beyoncé’s songs
RStudio and some special libraries

Part 1 - Clearing the data.

Let’s go!

For lyrics, you can go to this Kaggle site and download the 99Mb dataset that contains Beyoncé's songs, and then to this Kaggle site that contains dataset with Queen's songs.

Then, we have to launch RStudio, load the data, inspect it, clear it, and prepare the subsets that will be analysed.

Load the data:

setwd("C:/computor/directory_with_your_downloaded_data/")
songdata <- read.csv(file="songdata.csv")
lyrics <- read.csv(file="lyrics.csv")

Inspect it:

str(songdata)
'data.frame':  57650 obs. of  4 variables:
 $ artist: Factor w/ 643 levels "'n Sync","ABBA",..: 2 2 2 2 2 2 2 2 2 2 ...
 $ song  : Factor w/ 44824 levels "'39","'59 Crunch",..: 1364 2345 2896 3677 3678 6022 6771 7219 8410 8598 ...
 $ link  : Factor w/ 57650 levels "/a/abba/ahesmykindofgirl_20598417.html",..: 1 2 3 4 5 6 7 8 9 10 ...
 $ text  : Factor w/ 57494 levels "'AD 'AKHSHYAV LO NISH'AR DAVAR  \nTEN LI SIMAN  \nHAKE'EV SHEHAYAH NISH'AR  \nKEN KOL HAZMAN  \nATAH MITRACHEK, K'MU GAL NE'ELA"| __truncated__,..: 31695 43057 17616 32559 32560 51085 11108 8594 25646 18976 ...

 str(lyrics)
'data.frame':  339277 obs. of  6 variables:
 $ index : int  0 1 2 3 4 5 6 7 8 9 ...
 $ song  : Factor w/ 236867 levels "0-0","0-0-0",..: 56272 205225 85347 233083 23435 8810 146916 219754 178029 228598 ...
 $ year  : int  2009 2009 2009 2009 2009 2009 2009 2009 2009 2009 ...
 $ artist: Factor w/ 17088 levels "009-sound-system",..: 4499 4499 4499 4499 4499 4499 4499 4499 4499 4499 ...
 $ genre : Factor w/ 12 levels "Country","Electronic",..: 10 10 10 10 10 10 10 10 10 10 ...
 $ lyrics: Factor w/ 229599 levels "","\003Its what youre afraid of.\nAll of my fears,\nAll of my Faults.\nAll that came first,\nAll will be lost.",..: 141437 150816 108358 142333 149557 92449 187210 196953 16818 136101 ...

head(songdata)
  artist                  song                                       link
1   ABBA Ahe's My Kind Of Girl /a/abba/ahesmykindofgirl_20598417.html
2   ABBA      Andante, Andante      /a/abba/andanteandante_20002708.html

 ... <truncated>
1 Look at her face, it's a wonderful face  \nAnd ... <truncated>
2 Take it easy with me, please  ... <truncated>
(...)

head(lyrics)
  index                   song year          artist genre
1     0              ego-remix 2009 beyonce-knowles   Pop
2     1           then-tell-me 2009 beyonce-knowles   Pop
                                                                                              ... <truncated>
1 Oh baby, how you doing? ... <truncated>
2  ... <truncated>
(...)

colnames(songdata)
[1] "artist" "song"   "link"   "text"  
colnames(lyrics)
[1] "index"  "song"   "year"   "artist" "genre"  "lyrics"

From the above we know that first dataset is composed from 4 columns, "artist", "song", "link" and "text", all of them are factor type. Second dataset has 6 columns, "index", "song", "year", "artist", "genre" and "lyrics", also factors, apart from index, which is an integer. We won't need index and link columns, so we will delete them from the subsets. Also, we would like to change factor type to string in the "artist" columns in subsets, you will see why.

Look for Beyoncé and Queen and substract them:

Both datasets contain the "artist" column, but we don't know how the names are stored - capital letters? Hyphen, spaces? E in BEYONCE has an accent, is it considered? Instead of looking for the value equal to "Beyonce" or "Queen" we will grep the column searching for string that matches. Then, we will copy found rows into new datasets and clear them.

nrow(songdata[grep("queen", songdata$artist, ignore.case = TRUE),])
[1] 413
nrow(lyrics[grep("queen", lyrics$artist, ignore.case = TRUE),])
[1] 41
nrow(lyrics[grep("beyonc", lyrics$artist, ignore.case = TRUE),])
[1] 371
nrow(songdata[grep("beyonc", songdata$artist, ignore.case = TRUE),])
[1] 0
queen <- songdata[grep("queen", songdata$artist, ignore.case = TRUE),]
beyonce <- lyrics[grep("beyonc", lyrics$artist, ignore.case = TRUE),]

Clear data subsets:

Let’s check what artists did we found using grep:

table(queen$artist)

                                     'n Sync                                         ABBA 
                                           0                                            0 
                                 Ace Of Base                                 Adam Sandler 
                                           0                                            0 
                                       Adele                                    Aerosmith 
                                           0                                            0 
                                  Air Supply                                Aiza Seguerra 
                                           0                                            0 
                                     Alabama                         Alan Parsons Project 
                                           0                                            0 
(...)
                                       

What the hell? This is not what we wanted! Why does it look like that? Remember when we check the types of columns? Artists are stored as factor with levels, it means that we get the whole lists of them, even if the count is 0 (if you wanna read about factors, you can visit this page). As I mentioned before, we wanna change factor to string.

queen <- data.frame(lapply(queen, as.character), stringsAsFactors=FALSE)
table(queen$artist)

                  Queen           Queen Adreena           Queen Latifah Queens Of The Stone Age             Queensryche 
                    163                      41                      50                      68                      91 

See? Much better. Now we clearly see we have to get rid of Queen Adreena, Queen Latifah, Queens Of The Stone Age and Queensryche rows.

queen <- queen[queen$artist=="Queen",]
unique(queen$artist)
[1] "Queen"
nrow(queen)
[1] 162

How about Beyoncé?

beyonce <- data.frame(lapply(beyonce, as.character), stringsAsFactors=FALSE)
unique(beyonce$artist)
[1] "beyonce-knowles"  "beyoncas-shakira" "beyonce"         
beyonce <- beyonce[beyonce$artist=="beyonce",]
unique(beyonce$artist)
[1] "beyonce"
nrow(beyonce)
[1] 118

For the final touch, let's get rid of columns we won't use and change the name of the columns so they will be identical in both subsets.

colnames(beyonce)[colnames(beyonce)=="lyrics"] <- "text"
beyonce$index <- NULL
beyonce$genre <- NULL
queen$link <- NULL

Now our subsets are ready to be analysed!

Part 2 - Analysis.

Text Mining

Is Freddie's music really more complicated and complex? Let's check it. First of all, to operate on data, we need to define the CORPORA. What is Corpora? Corpora are collections of documents containing (natural language) text (to know more, type ?Corpus in RStudio). It will be created using Corpus function, and the songs will be passed to that function as a vector, using VectorSource function (to know more, type ?VectorSource in RStudio). Then, we will prepare the Corpus to be analyzed - we will remove punctuation, numbers, stop words and strip spaces, and then convert everything to lower case.

library(tm)
#create Corpus
tmq <- Corpus(VectorSource(queen$text))
#remove puctuation
tmq <- tm_map(tmq, removePunctuation)
#remove numbers
tmq <- tm_map(tmq, removeNumbers)
#strip white space
tmq <- tm_map(tmq, stripWhitespace)
#remove stopwords
# For a list of the stopwords, see:   
# length(stopwords("english"))   
# stopwords("english")   
tmq <- tm_map(tmq, removeWords, stopwords("english"))  
#convert to lower case
tmq <- tm_map(tmq, tolower)

To proceed, create a document term matrix. This is what you will be using from this point on.

tmqmatrix <- DocumentTermMatrix(tmq)
inspect(tmqmatrix)
<<DocumentTermMatrix (documents: 162, terms: 3171)>>
Non-/sparse entries: 11008/502694
Sparsity           : 98%
Maximal term length: 21
Weighting          : term frequency (tf)
Sample             :
     Terms
Docs  and can dont get just love ooh time yeah you
  119   7   1   11   0    1    1   4    0    9   1
  14    6   0    4   1    1    0   0    6    0   1
  158   1   2    1   0    0    3   1    2    0   1
  21    3   1    2   0    2    8   7    2    4   3
  25    0   0    1   2   26    6   0    1    4   7
  33    1   3    1   0    4    0   0    1    0   0
  47    1   7    0   4    4   33   8    0    9   1
  81    1   3    1   0    0    4   0   11    4  13
  84    1   0   21   0    2    0   6   13    8   0
  98   11  12    0   0    0    0   3    1    8   4

What have we done? The above matrix (we can see only the small part of it) shows in how many documents (left index) how many times a word (upper index) has been used. Translation to human: the word CAN has been used once in 119 documents, twice in 158 documents, three times in 33 documents, et cetera. This means that if we will sum the columns, we will know how many times the word was used in all artist's creation.

tmbmatrix <- DocumentTermMatrix(tmb)
freqb <- colSums(as.matrix(tmbmatrix))
head(sort(freqb, decreasing = TRUE))
love like dont baby  you  can 
 494  401  365  324  299  293 
tail(sort(freqb, decreasing = TRUE), 10)
moneydivas passengers      state  stilettos  strutting   daughter       road      sense     smiled       youd 
         1          1          1          1          1          1          1          1          1          1

tmqmatrix <- DocumentTermMatrix(tmq)
tmqmatrix
freqq <- colSums(as.matrix(tmqmatrix))
head(sort(freqq, decreasing = TRUE))
love yeah dont  ooh  and  you 
415  385  302  239  233  209 
tail(sort(freqq, decreasing = TRUE), 10)
butterflies     curtain      failed    mindless    overkill     painted   pantomime      spaces       tales      warmer 
          1           1           1           1           1           1           1           1           1           1 

Oh yes, love… Apparently the most used word by the both of our artists.

Also, the number of columns tells us, how many different unique words the artist was able to use. Let’s see…

ncol(tmqmatrix)
[1] 3171
> ncol(tmbmatrix)
[1] 2575

Queen - 3171 words! Beyonce - 2575. Not good! Freddie knows 596 words more!

Now that we know about the numbers, maybe we will talk about emotions. Is there a way to check if the songs were positive or negative? Sure there is!

Sentiment analysis

There are a variety of dictionaries that exist for evaluating the opinion or emotion in text. I've decided to use `sentiment140` from okugami79 user because it worked best with my RStudio. As okugami79 writes on his github page, the package is:

Easy to use, quick to run your own sentiment analysis of Twitter context free grammer No additional installation of NLP components - it uses free sentiment140 service, they do vocaburay training, syntax of hash, http link etc. No need for vacaburary building Default language model is tuned for Twitter message, context free grammer language model_ Supported languge: English and Spanish

Yes, I know it says the package serves to analyze Twitter, but it also works for this example. Let's download it!

install.packages("devtools")
library("devtools")
install_github('sentiment140', 'okugami79')
library(sentiment)

Remember the Corpus we've created? As the text inside the Corpus is well prepared, we can use it to our sentiment analysis. How to get a text from the inside of a Corpus?

#nope
tmb$1
Error: unexpected numeric constant in "tmb$1"

#nope
tmb$`1`
NULL

#nope
tmb[1]
<<SimpleCorpus>>
Metadata:  corpus specific: 1, document level (indexed): 0
Content:  documents: 1

#nope
tmb[[1]]
<<PlainTextDocument>>
Metadata:  7
Content:  chars: 771

#hha! got it!
tmb[[1]]$content
[1] "I, I, I left no time to regret\nKept my (...)

We can pass the text to the sentinment() function.

sentimentB <- sentiment(tmb$content)
str(sentimentB)
'data.frame':  118 obs. of  3 variables:
 $ text    : Factor w/ 91 levels " "," (Ay)(Ay)(Ay, Nobody likes being played)Oh, Beyonce, BeyonceOh, Shakira, Shakira (Hey)He said, I m worth it, his whim desireHe "| __truncated__,..: 91 36 58 43 81 46 68 34 60 9 ...
 $ polarity: chr  "negative" "negative" "positive" "positive" ...
 $ language: Factor w/ 2 levels "en","es": 1 1 1 1 1 1 1 1 1 1 ...

As you can see, in `sentimentB` we have a column named "polarity". That's it. For each song, an alghoritm has decided if it is positive, negative or neutral. How did he knew that? He received a training dataset containing words that express the feelings and was trained to recognize and classify them.

table(sentimentB$polarity)

negative  neutral positive 
      38       12       68

Wanna see the percents? Here:

prop.table(table(sentimentB$polarity))*100

negative  neutral positive 
32.20339 10.16949 57.62712

Same operation for Queen’s songs:

sentimentQ <- sentiment(tmq$content)
table(sentimentQ$polarity)

negative  neutral positive 
     244       44      125 
prop.table(table(sentimentQ$polarity))*100

negative  neutral positive 
59.07990 10.65375 30.26634 

Ok, so Freddie tended to sing about sad things. Are we done with the analysis? Nope! We have to present our results to the world in a proper form. Therefor, we go to.... Visualization Part.

Part 3 - Visualization.

Let’s gather all the information we have:

what	Queen	Beyoncé
songs	162	118
most used words	love yeah dont ooh and you	love like dont baby you can
least used words	butterflies curtain failed mindless overkill painted pantomime spaces tales warmer	moneydivas passengers state stilettos strutting daughter road sense smiled youd
how many different words	3171	2575
sentiment	negative:59.07990 neutral:10.65375 positive:30.26634	negative:32.20339 neutral:10.16949 positive:57.62712

Create piecharts, barplots and anything that would show numbers in a nice way:

I won’t explain the creation of plots here, but I will show you the code, so you can use it:

Piecharts to show the mood:

bey <- as.numeric(unname(prop.table(table(sentimentB$polarity))*100))
queen <- as.numeric(unname(prop.table(table(sentimentQ$polarity))*100))
mood <- c("negative","neutral","positive")
colnam <- c("Queen","Beyonce","mood")
piechart <- data.frame(bey,queen,mood)
colnames(piechart) <- colnam
     Queen  Beyonce     mood
1 32.20339 59.07990 negative
2 10.16949 10.65375  neutral
3 57.62712 30.26634 positive

cols <- c('rgb(141,160,203)', 'rgb(102,194,205)', 'rgb(102,194,230)' )

plot_ly(piechart, labels = mood, values = ~Beyonce, type = 'pie',
             textposition = 'inside',
             textinfo = 'label+percent',
             insidetextfont = list(color = '#FFFFFF'),
             marker = list(colors = cols,
                      line = list(color = '#FFFFFF', width = 1)),
             showlegend = FALSE) %>%
  layout(title = 'Mood in the songs',
         xaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE),
         yaxis = list(showgrid = FALSE, zeroline = FALSE, showticklabels = FALSE))

Number of words per artist:

plot_ly(
  x = c("Queen", "Beyonce"),
  y = c(3171, 2575),
  showlegend = FALSE,
  name = "number of songs per artist",
  type = "bar",
  color = c(rgb(141,160,203)', 'rgb(102,194,205)) %>%
  layout(title='number of songs per artist')

Most used words:

freq <- sort(freqq, decreasing = TRUE)
wf <- data.frame(word=names(freq), freq=freq)
ggplot(wf, aes(x = reorder(word, -freq), y = freq)) +
  geom_bar(stat = "identity", fill="#8da0cb") + 
  theme(axis.text.x=element_text(angle=45, hjust=1),
        panel.background = element_blank(),
        axis.title.x=element_blank(),
        axis.title.y=element_blank(),
        plot.title = element_text(hjust = 0.5))+
  ggtitle("Queen's most used words")

Create a wordcloud

To show most used words we can also use something called WORDLCOUD. Yes, it is a cloud composed from words, good thinking! It is super easy, because we have a special library that will do it for us, and we already have everything we need. But, for those who only are here to know how to create a wordcloud, I would remind the previous steps.

Prerequisites:

setwd("C:\\your_directory\\song_lyrics")
songdata <- read.csv(file="songdata.csv")
beyonce <- lyrics[grep("beyonc", lyrics$artist, ignore.case = TRUE),]
beyonce <- beyonce[beyonce$artist=="beyonce",]

colnames(beyonce)[colnames(beyonce)=="lyrics"] <- "text"

library(tm)

tmb <- Corpus(VectorSource(beyonce$text))
#remove puctuation
tmb <- tm_map(tmb, removePunctuation)
#remove numbers
tmb <- tm_map(tmb, removeNumbers)
#strip white space
tmb <- tm_map(tmb, stripWhitespace)
#remove stopwords
# For a list of the stopwords, see:   
# length(stopwords("english"))   
# stopwords("english")   
tmb <- tm_map(tmb, removeWords, stopwords("english"))  
#convert to lower case
tmb <- tm_map(tmb, tolower) 
tmbmatrix <- DocumentTermMatrix(tmb)
freqb <- colSums(as.matrix(tmbmatrix))

Wordcloud magic:

set.seed(142)   
wordcloud(names(freqb), freqq, min.freq=25)

Is it informative, nice-looking, readable? No! We need to define some parameters. First of all, we don't need EVERY single word. Let's narrow their number to 100. Second, we need some color (for color palettes, check brewer page).

GnBu <- brewer.pal(6, "Blues")   
wordcloud(names(freqb), freqb, max.words=100, rot.per=0.2, colors=GnBu)

Last but not least, all information gathered on one infographic.

Written on July 13, 2017