Deadlifts and Derivatives

Updates and Research from Steve Bronder

The purpose of this site is to give current information on Steve Bronder's (me) research and personal life

Natural Language Processing in R: Duquesne Economic Senior Theses

The purpose of this post is to give an example of R’s natural language processing packages TM and qdap. To do this we will analyze the winning papers from each year of the Duquesne Economics Senior Thesis. Duquesne economics majors take part in a one-semester self-directed project in which each student picks a topic, conducts research, and then presents their findings in front of a panel of faculty and experts from outside the university. Each year a winner is selected and given the honor of being the best research of that year. Doctor Antony Davies makes these available here. We will examine each years winning senior thesis in order to determine common themes between and within papers.

Turning PDF’s Into Text Documents

In order to examine the theses we will convert each PDF into a text document. To do this we will use pdf2text available here and code from Ben Marwick available here. The pdfs and converted text files are available in a zip file here. Once you have downloaded the files we can specify where the files are and will end up by pointing R to the directory that has the PDFs, using list.files() to create a character vector of the names of the files within the directory, and then use sapply() to iteratively loop over each file name and remove white space. Then, we will re-use list.files() to grab the now white space-free file names and point them to pdftotext.exe.

# specifying '.' in our directory means "starting from the current working directory" which I assume is one folder above where the files are located.
dest <- "./thesis/"
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# remove whitespace
sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# apply over each file pdf2text to convert each pdf to a text file
lapply(myfiles, function(i) system(paste("./xpdfbin-win-3.04/xpdfbin-win-3.04/bin64/pdftotext.exe", paste0('"', i, '"')), wait = FALSE) )

If you look in the folder that contains your PDF files you will now also have text files for each of the original senior theses.

Creating a Corpus

Now that we have all of our files in text format, we will utilize tm to create a corpus of the documents. A corpus is a structured collection of texts that will keep our documents sorted while we perform this analysis. After loading up tm and qdap (we will need it later so why not load it now?) we will use the Corpus() function to create the corpus for our analysis.

library(tm)
library(qdap)

dest <- "./thesis"

# create corpus
docs <- Corpus(DirSource(dest,pattern="txt"))

Because a later process requires us to break each of these documents up by sentence, we will remove all numbers because many of them are decimals and qdap becomes confused by the periods that do not mean the end of a sentence. We will also remove what are known as “stop words”. Stop words are words that show up a large amount of time, but do not really mean anything. For instance, the words “the” and “and” are recognized as stop words because they show up frequently in the English language, but have no real meaning. If we did not remove the stop words we may think two authors who just so happen to use the word “but” frequently are extremely similar, while actually they just say the arbitrary word “but” very frequently. We will also do another arbitrary thing and replace each capital letter with a lowercase one. We do this so R knows that “Heteroskedasticity” and “heteroskedasticity” are the same word.

# remove numbers
docs <- tm_map(docs, removeNumbers)

# remove stop words
docs <- tm_map(docs, removeWords, stopwords("SMART"))

# make all letters lower case
docs <- tm_map(docs, content_transformer(tolower))

Cleaning Documents with qdap

In order to do some extra cleaning we will use qdap to convert the documents to a data frame. We need to do this because most files contain a title page which has a large amount of arbitrary text on it such as “Division of Economics”, “AJ Palumbo School of Business Administration”,“Duquesne University”,“Pittsburgh, Pennsylvania”, and “december”. We can do this by using the mgsub() function which searches for matches to what we specify and replaces them with whatever we want (in this case just a blank space). We will also remove escaped characters using clean(), misc anomalies with Trim(), and replace any possible abbreviations such as Dr. and Jr. to Doctor and Junior with replace_abbreviation(). We will also use mgsub() to remove abbreviated middle names such as “K.” and “A.”, because once we turn these into sentences **qdap* would be confused about these periods and make a sentence where it should not.

# convert corpus to data frame
qdocs <- as.data.frame(docs)

# put in author names
qdocs$docs <- c("Brady","Curcio","France","Gangewere","Horne","Kandrack","Valchev","Vicinie","Whitaker")

# remove misc anomalies and white spaces
qdocs$text <- clean(Trim(qdocs$text))

# replace symbols with words a.k.a. % with 'percent'
qdocs$text <- replace_symbol(qdocs$text)

# replace Ph.D with doctor
qdocs$text <- gsub("ph.d.","Doctor",qdocs$text)


# replace other abbreviations
qdocs$text <- replace_abbreviation(qdocs$text)

# remove all abbreviated middle names
qdocs$text <- mgsub(paste0(" ",letters,".")," ",qdocs$text)

# remove other custom stop words and garbage left from tables
qdocs$text <- mgsub(c("division ", " economics "," aj "," palumbo "," school "," business "," administration "," duquesne "
                      ," university "," pittsburgh, ", " pennsylvania "," december "," submitted ", " economics "," faculty "
                      ," partial ", " fulfillment ", " requirements ", " degree "," bachelor "," science "," business ",
                      " administration "," BSBA "," mcanulty ","department ", " college ","liberal arts ",
                      "..","*","**","***","()","( )","-","(",")",",,","(.)",",,","-.*",". .",", ,","|","+","_","-","=","["
                      ,"]",". ,",".:",":."," . ","<.","` ",", .",".,",",t ."," ., ",". .","^","'")," ",qdocs$text)

Analyzing words and sentences

Pretty Graphs of Words and Standard Deviations

Now that we have cleaned up our text files we can start making some pretty graphs! At this point we will split each document up by sentence using the sentSplit() function. Once the sentences are split up we will plot them using qdaps built in plotting function for data frames of sentences. You may get a warning message coming up stating “non character, missing end punctuation, etc.” This warning can be due to some bits of the tables still remaining or not completely scrubbing out all of the junk (or scrubbing too much!). We’ll ignore this for now as we can look back at any file in qdocs and see that we have a pretty good general cleaning.

sent.docs <- sentSplit(qdocs,"text")
## Warning in sentSplit(qdocs, "text"): The following problems were detected:
## non character, missing ending punctuation, non space after comma, non ascii, containing escaped, indicating incomplete
## 
## *Consider running `check_text`
plot(sent.docs, grouping.var = "docs")

This plot shows that senior theses normally contain about 3800 words with a max of about 5000 words. Curcio, Vicinie, and Whitaker all won senior thesis with only about 3000 words! My good friend Bill Gangewere and Kandrack both sit right near the median number of words. The large spread in word count tells us that short and long papers both have a pretty good chance at winning thesis. However, it should be noted that most theses do sit around the 3800 word mark so, as theses go, short is sweet.

We can make this graph a bit more illustrative by examining the standard deviation of word counts for each sentence of the thesis. To do this, we will create new columns called scaled and ID. Reading from the inside out we use the function wc() to get the number of words in each sentence. Then we use the function tapply() in order to scale each count between 0 and 1. unlist() is used to put these scaled counts back in a vector which is attached to qdocs. The ID’s are created by something similar, but using seq_along() on the docs to create a vector that counts each sentence.

sent.docs$scaled <- unlist(tapply(wc(sent.docs$text), sent.docs$docs, scale))
sent.docs$ID <- factor(unlist(tapply(sent.docs$docs, sent.docs$docs, seq_along)))

Now we use a big ol’ ggplot() statement to make a pretty graph of the standard deviation of each the word count for each sentence.

## Plot with ggplot2
library(ggplot2); library(grid)

 ggplot(sent.docs, aes(x = ID, y = scaled, fill = docs)) +
  geom_bar(stat ="identity", position ="identity",size = 5) +
  facet_grid(.~docs) + 
  ylab("Standard Deviations") + xlab("Sentences") +
  guides(fill = guide_legend(nrow = 1, byrow = TRUE,title="Author"))+ 
  theme(legend.position="bottom",axis.text = element_blank(), axis.ticks = element_blank()) +
  ggtitle("Standardized Word Counts\nPer Sentence")+coord_flip()

We can see that our good friend Bill Gangewere and Kandrack have the most consistent word count per sentence. Horne, while having the longest thesis, has a pretty consistent word count per sentence besides a place in the middle where things get a bit lengthy. Looking at this graph we can say that most theses have a pretty consistent word count, with one or two places that require a longer explanation.

Using tm to inspect word

At this point we will go back to our docs file so we can start using tm for analysis. There is functionality to make qdap objects into tm objects, but I found it complicated to get the qdap to do what I wanted. We will remove punctuation, our specific stop words, and extra white space. This time we will also stem our words. Stemming means to break words down into their base word. So “stopping”, for example, turns into “stop”.

library(SnowballC)

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeWords,c("division ", " economics "," aj "," palumbo "," school "," business "," administration "," duquesne "," university "," pittsburgh, ", " pennsylvania "," december "," submitted ", " economics "," faculty "," partial ", " fulfillment ", " requirements ", " degree "," bachelor "," science "," business "," administration "," BSBA "," mcanulty ","department ", " college ","liberal arts","()","()","( )","-","(",")",",,","(.)",",,","-.*",". .",".:",":.","[","]","<.","` ",", .",".,",",t ."," ., ",". .","^","'","this","the"))

docs <- tm_map(docs, removeWords,c("the","then","this","data", "econom",  "estim",
                                   "expect", "find",  "inform", "level", "measur",
                                   "model", "perform", "research" , "result", "signif",
                                   "studi", "tabl", "variabl","appendix"))

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, stemDocument)

Now that we have done some cleaning we will create a document term matrix. A document term matrix (DTM) is a matrix with as many columns as unique words and rows as documents. If a word exists in the document term matrix we will count how many times it exists in the document and place that count in our matrix. Otherwise, the cell for a word will hold zero. You may have also heard of sparse matrices. A document term matrix is normally a sparse matrix because few words show up frequently in each document. Each term is also assigned a weight in our DTM that represents how frequently the word is used throughout the documents.

dtm <- DocumentTermMatrix(docs)
rownames(dtm) <- c("Brady","Curcio","France","Gangewere","Horne","Kandrack","Valchev","Vicinie","Whitaker")
dtm
## <<DocumentTermMatrix (documents: 9, terms: 4402)>>
## Non-/sparse entries: 9138/30480
## Sparsity           : 77%
## Maximal term length: 82
## Weighting          : term frequency (tf)

Now that we have a DTM we can use colSums() to find the most frequently viewed words. We will then order the sums and pull out the tail end (because order sorts from least to most) of the first fifteen words.

freq <- colSums(as.matrix(dtm))


ord <- order(freq)


# most frequent terms
freq[tail(ord,n=15)]
Most Frequently Used Words
measur risk result state govern studi breach signific bank expect effect growth econom
187 189 207 217 220 226 234 238 260 261 268 284 311

Looking at this you would think most theses were about banks, breaches, and risk. However, while these terms show up quite a lot, they most likely only exist in one document, or to say these words are sparse. Before we do any further analysis, notice above that our matrix is rather sparse with about 80% of the matrix being zeroes! We can use tm’s removeSparseTerms() function to remove terms which only occur in less than 30% of all instances.

# remove sparse terms
dtms <- removeSparseTerms(dtm, 0.3)
dtms
## <<DocumentTermMatrix (documents: 9, terms: 259)>>
## Non-/sparse entries: 2051/280
## Sparsity           : 12%
## Maximal term length: 16
## Weighting          : term frequency (tf)

Now with some of the sparsity removed we can examine things like correlations between words and words that have a high frequency. Using findFreqTerms() we will examine our new sparse matrix and pull out the terms that have more than 155 uses. After we find that the word ‘state’ is used frequently we can use findAssocs() to find words that have a correlation higher than .6 with ‘state’. After that, we can plot a pretty graph of frequent terms and restrict the graph to only show terms that have a correlation above .6.

# find terms with higest frequency
# low freq chosen to produce most frequent
findFreqTerms(dtms, lowfreq=15)

# find words with high correlation to state
findAssocs(dtms,"state", corlimit=0.7)

# make a plot of freq terms with correlation above .6
plot(dtms,terms =findFreqTerms(dtms, lowfreq=100),corThreshold=0.6)

Most Frequently Used Words
Frequency.1 Frequency.2 Frequency.3 Frequency.4 Frequency.5 Frequency.6 Frequency.7
econom effect estim expect growth individu inform

Words associated with State
size appli growth compon vari unit there econom interest analysi general
state 0.97 0.84 0.84 0.83 0.83 0.82 0.81 0.8 0.79 0.78 0.72
## Loading required package: Rgraphviz
## Loading required package: graph

The frequent terms were cut to 10 in order to save space. With the most frequently used words being the stems of words like growth, effect, individual, and time we can say that most winning theses research growth and effects of economic variables and data on individuals over time. States tend to be a variable grouping and the size of states also seems to matter. Examining this plot it appears that regress is very involved with errors, observations, significance, and sample. This hints towards the idea that most winning theses are regression based. Test and error are connected and so a winning senior thesis may also perform strenuous checks on the error terms of their regression. Most analysis seem to be over states and time.

The last thing we will do is examine the number of letters in each word by Senior Thesis. To do this we will use the R package reshape2 to turn our wide column format matrix into a long column format matrix. Essentially, when we turn a matrix into its long format we take the column names, make a vector out of them, and match each column name to each word and count of that word for the author. Then we use the column named docs as our faceting variable and make the pretty graph below.

library(reshape2)

dtm.mat <- as.matrix(dtm)

dtm.melt <- melt(dtm.mat)

dtm.melt <- as.data.frame(dtm.melt)

knitr::kable(head(dtm.melt,n=10))
Docs Terms value
Brady abagail 0
Curcio abagail 0
France abagail 0
Gangewere abagail 1
Horne abagail 0
Kandrack abagail 0
Valchev abagail 0
Vicinie abagail 0
Whitaker abagail 0
Brady abbrevi 0
# make table of word counts
term.table <- table(nchar(as.character(dtm.melt$Terms)))

# get mode of word counts
term.mode <- as.numeric(names(term.table[which(term.table==max(term.table))]))

# make condition to highlight mode
cond <- nchar(as.character(dtm.melt$Terms))*dtm.melt$value == term.mode

#make a pretty graph
ggplot(dtm.melt) + geom_histogram(data=subset(dtm.melt,cond==FALSE),binwidth=1,aes(x=nchar(as.character(Terms))*value,fill=Docs))+
  facet_grid(Docs~.) + geom_histogram(data=subset(dtm.melt,cond==TRUE),binwidth=1,aes(x=nchar(as.character(Terms))*value),title="Mode",fill="red")+
  labs(x="Number of Letters",y="Number of Words") + xlim(1,20)  +
  guides(fill = guide_legend(nrow = 9, byrow = TRUE ,title="Author"))+ 
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("Length of each word \n by Author")+
    geom_text(data=data.frame(x=6.5, y=30, label="Mode", stat=c("ta")),aes(x,y,label=label),size=3, inherit.aes=TRUE)

The mode, the most frequent number of letter per word, is six. So most words tend to be relatively short. This graph alludes to the idea that the best theses are ‘short and sweet’. The longer words are most likely statistical terms like “heteroskedasticity” and “endogeneity”. Taking everything into account, we can say that a winning senior thesis can be long or short, will be consistent in the number of words per sentence, will use short words over longer ones, and will be a topic that looks at economic variables over U.S. states and time. So maybe next years winning thesis will be “Measuring the Expected Size of Economic Growth by State”.

I’m satisfied with R’s NLP packages and think it will continue to be one of the go to programs for this type of analysis. One of R’s main benefits that shines in NLP is how easy it is to make pretty graphics. When working with large documents, summarizing the data in a graph allows researchers to explore the data in a concise manner.