Deadlifts and Derivatives

Updates and Research from Steve Bronder

The purpose of this site is to give current information on Steve Bronder's (me) research and personal life

Understanding rCharts

Why use rCharts?

Interactive visualization is the new ‘hot take’ for data people. ggvis and rCharts have become the two easiest ways to make these graphs, but there is a clear difference that needs to be made. rCharts is to ggvis what R is to STATA. They both have the same task. One has a low learning curve and makes it easy to do very simple things while the other has a high learning curve and is made to take on custom projects. When looking at which to use, the drastic differences in learning curves should only be the focus if you have a project due immediately. Otherwise, only the ability to make your life easier in the long run and allow room for growth should be focused on. Throughout this post it will be made clear which makes your life easier in the long run. The purpose of this post is to demonstrate how rCharts works, explain how to find the documentation you need, and show how to export projects to be seen on the web with either a direct link or embedded iframe.

Setting up rCharts and Examples

I would recommend checking out two places to get set up / see examples

  1. The rCharts home page to set up rCharts with devtools

http://ramnathv.github.io/rCharts/

  1. The gallery of rCharts (no longer broken!) has tons of awesome examples and code for making graphs.

http://rcharts.io/gallery/

Look at this gorgeous example of pyramid (histogram) charts in rCharts!

http://walkerke.github.io/2014/06/rcharts-pyramids/

We will use Kyle Walker’s code for this project to go through how rCharts works.

Documetation Woes

You may have noticed that rCharts does not have any documentation. This is true. However, we’ll see quickly it may be due to the fact that documentation is not really necessary. Here’s one secret that people tend to hand wave through, but should stress. ggvis is based off of vega, one particular javascript library that is full of helper functions to write d3 code. ggvis and rCharts both translate your R code into a helper library such as Vega which then translates everything into d3/javascript. This is why rCharts and ggvis need a browser to run!

rCharts can use Vega, along with about nine other libraries for writing d3 code. This makes it much more flexible as you can see from the examples on the rCharts home page. With access to multiple libraries you can use the library that is best suited for the graphic you want to make.

Keep this process in mind throughout the example from the pyramids chart from above.

R -> ‘Some helper library’ -> D3

The code is simple once it is broken down into pieces.

Code Breakdown

h1 sets up the plot for us and is the bread and butter of making rCharts. hPlot uses the Highcharts library to build your interactive visualization like ggvis does with Vega.

  h1 <- hPlot(
    y = 'Population', 
    x = 'Age', 
    type = 'bar', 
    data = dat.melt,
    group = 'Gender')

Alright so hey, that’s simple! We tell R what our x and y variables are, what type of graph, where the data is, and what variable we use to group the data. Lets look at how weird the next line looks and maybe I will blow your mind for a second.

  h1$plotOptions(series = list(stacking = 'normal', pointPadding = 0, borderWidth = 0))

What is this glob of cash signs and lists? The cash sign is used to access variables for what is known as a reference class object like how we can grab our coefficients with mylm$coefficients. It’s pretty similar to a python class if that is easy for you to think about. You can see the reference class for hPlot() here.

So where and how the heck do we find what ‘series’, ‘stacking’, etc. is? Check out the Highcharts API. This is the documentation for the Highcharts library that the R code is being translated into. Remember when we made the type of graph ‘bar’? Look at the link:

http://api.Highcharts.com/Highcharts#series<bar>

Look at that. All of our variables are documented inside of Highcharts for changing the series in a bar chart. So in a way, rCharts has tons of documentation, it’s just written in the libraries. We can look through this and see what stacking, pointpadding, series, etc. stands for. Series contains a list because it has its own parameters to assign. So when you see a parameter in Highcharts that has its own parameters you place those sub-parameters in a list. The reason rCharts has such little documentation is because it would be silly to write all of this stuff twice. Once in rCharts and another time in the actual library. And since rCharts is only translating your R code to this respective library and then to d3 we can just look at the libraries docs for our parameters and variables documentation.

The next line shows where this style of thinking through translating shines. Here we are going to make a tooltip (the thing that shows up when you hover over the chart). The trick is, we are going to write the tooltip in javascript in R.

h1$tooltip(formatter = "#! function() { return '<b>'+ this.series.name +', age '+ this.point.category +'</b><br/>' + 'Population: ' + Highcharts.numberFormat(Math.abs(this.point.y), 0);} !#")

With just a bit of R (calling the correct variable in the reference class) and some javascript, we’ve taken one of the more difficult and impressive things in d3 and made it pretty simple. We create an anonymous function ( function(){ ) and have it return what we would like to see in the tooltip. The whole function is wrapped in a "#! so rCharts knows to take this code and treat is as javascript. We can even throw in some html to get the nice formatting we want.

With some of the knowledge you have now, find the next line in the Highcharts API and figure out what reversed does.

h1$legend(reversed = "true")

With the knowledge of how we use the reference class to access variables and parameters in Highcharts, look at how we change the y axis labels with a javascript function. I leave it to the reader as an exercise to look into the Highcharts API and see what the parameters in yAxis do.

  if (max(dat.melt$Population >= 1000000)) {
    h1$yAxis(labels = list(formatter = "#! function() { return (Math.abs(this.value) / 1000000) + 'M';} !#"), 
             title = list(enabled = TRUE, text = 'Population'))
  } else {
    h1$yAxis(labels = list(formatter = "#! function() { return (Math.abs(this.value) / 1000) + 'K';} !#"), 
             title = list(enabled = TRUE, text = 'Population'))
  }
  
  # Colors is set earlier in his code
  if (!is.null(colors)) {
    h1$colors(colors)
  }
  `

We want people to be able to download the graph as an pdf, svg, etc. so we set exporting to TRUE.

  h1$exporting(enabled = TRUE)

To show the graphic we call the object and a browser should open up.

  h1

And that’s it! Now let’s go over how to show people your work.

Showing Off Your Work

The whole point of making pretty graphs is so other people can see it. One of my biggest complaints about shiny and ggvis is that there are only three ways to present a shiny app / ggvis presentation.

  1. They have R installed and can run your script (lol)
  2. Using shinyapps.io and sending them the link
  3. Having your own shiny server

One is almost never feasible in a business setting and is a pretty tedious process for only viewing a plot. Two and three are both viable, however the free version of shinyapps has a limited time that people are allowed to view the plot. So at some point you will have to fork over cash which I would rather not.

Let’s look at two methods to send an rCharts graphic, either as a html page you can send over email or as a link on gist

Sending an HTML Page

Let’s make a standard rChart from the rCharts page on sharing.

library(rCharts)
r1 <- rPlot(mpg ~ wt, data = mtcars, type = 'point')
r1$show('iframesrc', cdn = TRUE)

Notice in this example the plot goes directly on the page by calling $show(). This post is written in knitr, which allows R code to be written into an html page. The graph is translated into an iframe, which is then placed on the page. To show how to place the graph on a webpage that already exists we call the $save() function and use gist. gist is a sort of ‘one off’ version of git hub. It’s made for sharing code and files quickly and easily which is perfect for our purpose. So first lets save the plot.

r1$set(width = 400, height = 350)
r1$save('C:/Users/brond_000/Documents/Blog/rChartsBreakdown/mychart2.html', standalone = TRUE)

Here we save the plot to a specific folder, and set standalone equal to TRUE so the d3 and Highcharts libraries will be saved inside of the html page the save function generates. Once you save the chart, look in the folder and you should see a html page titled, “mychart2.html”. Clicking that will open up the plot! So now we can send that to someone as an email attachment. Now how do we send them a link? Go to gist and rawgit. When you open up gist you will see a box for code to go into. With the folder open in explorer, drag mychart2 from the folder over onto the box and let go. This will take the code for your plot and place it into the gist box. Now click either create public or secret gist and your code will be stored online! From there click on raw on the right side to go to a page with just your code on it with an address that starts with something like “gist.githubusercontent.com/…”.

Now go to raw git. Inside of the main bar copy/paste the address for your raw gist. Raw git will generate two links, a development and production link. While messing around or just sharing with a friend you can use the development link, but once everything is finalized I would encourage using the production link. Take the development link, paste it into your address bar and take a look at your graph!

Now you can place this graph inside of a webpage with an iframe.

<iframe src="http://cdn.rawgit.com/Stevo15025/c8e37aa1dae84791cdad/raw/0577777b74b980e3cacd66251187650a0c557d8b/mychart2.html"; width=900; height=500; seamless; frameborder=0;></iframe>

And there you have it! Now you can send rCharts to all of your friends and family with ease! In this post we covered how to find variables and parameters for rCharts from a D3 library’s API and how to share visualizations built in rCharts with friends and colleagues. Notice that our embedded iframe cuts off a little bit of the top and right side of our graph. I’ll be exploring why this happens, but for now I do not see this as an issue. If you have any questions feel free to email or leave a comment.

Natural Language Processing in R: Duquesne Economic Senior Theses

The purpose of this post is to give an example of R’s natural language processing packages TM and qdap. To do this we will analyze the winning papers from each year of the Duquesne Economics Senior Thesis. Duquesne economics majors take part in a one-semester self-directed project in which each student picks a topic, conducts research, and then presents their findings in front of a panel of faculty and experts from outside the university. Each year a winner is selected and given the honor of being the best research of that year. Doctor Antony Davies makes these available here. We will examine each years winning senior thesis in order to determine common themes between and within papers.

Turning PDF’s Into Text Documents

In order to examine the theses we will convert each PDF into a text document. To do this we will use pdf2text available here and code from Ben Marwick available here. The pdfs and converted text files are available in a zip file here. Once you have downloaded the files we can specify where the files are and will end up by pointing R to the directory that has the PDFs, using list.files() to create a character vector of the names of the files within the directory, and then use sapply() to iteratively loop over each file name and remove white space. Then, we will re-use list.files() to grab the now white space-free file names and point them to pdftotext.exe.

# specifying '.' in our directory means "starting from the current working directory" which I assume is one folder above where the files are located.
dest <- "./thesis/"
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# remove whitespace
sapply(myfiles, FUN = function(i){
  file.rename(from = i, to =  paste0(dirname(i), "/", gsub(" ", "", basename(i))))
})

# get the PDF file names without spaces
myfiles <- list.files(path = dest, pattern = "pdf",  full.names = TRUE)

# apply over each file pdf2text to convert each pdf to a text file
lapply(myfiles, function(i) system(paste("./xpdfbin-win-3.04/xpdfbin-win-3.04/bin64/pdftotext.exe", paste0('"', i, '"')), wait = FALSE) )

If you look in the folder that contains your PDF files you will now also have text files for each of the original senior theses.

Creating a Corpus

Now that we have all of our files in text format, we will utilize tm to create a corpus of the documents. A corpus is a structured collection of texts that will keep our documents sorted while we perform this analysis. After loading up tm and qdap (we will need it later so why not load it now?) we will use the Corpus() function to create the corpus for our analysis.

library(tm)
library(qdap)

dest <- "./thesis"

# create corpus
docs <- Corpus(DirSource(dest,pattern="txt"))

Because a later process requires us to break each of these documents up by sentence, we will remove all numbers because many of them are decimals and qdap becomes confused by the periods that do not mean the end of a sentence. We will also remove what are known as “stop words”. Stop words are words that show up a large amount of time, but do not really mean anything. For instance, the words “the” and “and” are recognized as stop words because they show up frequently in the English language, but have no real meaning. If we did not remove the stop words we may think two authors who just so happen to use the word “but” frequently are extremely similar, while actually they just say the arbitrary word “but” very frequently. We will also do another arbitrary thing and replace each capital letter with a lowercase one. We do this so R knows that “Heteroskedasticity” and “heteroskedasticity” are the same word.

# remove numbers
docs <- tm_map(docs, removeNumbers)

# remove stop words
docs <- tm_map(docs, removeWords, stopwords("SMART"))

# make all letters lower case
docs <- tm_map(docs, content_transformer(tolower))

Cleaning Documents with qdap

In order to do some extra cleaning we will use qdap to convert the documents to a data frame. We need to do this because most files contain a title page which has a large amount of arbitrary text on it such as “Division of Economics”, “AJ Palumbo School of Business Administration”,“Duquesne University”,“Pittsburgh, Pennsylvania”, and “december”. We can do this by using the mgsub() function which searches for matches to what we specify and replaces them with whatever we want (in this case just a blank space). We will also remove escaped characters using clean(), misc anomalies with Trim(), and replace any possible abbreviations such as Dr. and Jr. to Doctor and Junior with replace_abbreviation(). We will also use mgsub() to remove abbreviated middle names such as “K.” and “A.”, because once we turn these into sentences **qdap* would be confused about these periods and make a sentence where it should not.

# convert corpus to data frame
qdocs <- as.data.frame(docs)

# put in author names
qdocs$docs <- c("Brady","Curcio","France","Gangewere","Horne","Kandrack","Valchev","Vicinie","Whitaker")

# remove misc anomalies and white spaces
qdocs$text <- clean(Trim(qdocs$text))

# replace symbols with words a.k.a. % with 'percent'
qdocs$text <- replace_symbol(qdocs$text)

# replace Ph.D with doctor
qdocs$text <- gsub("ph.d.","Doctor",qdocs$text)


# replace other abbreviations
qdocs$text <- replace_abbreviation(qdocs$text)

# remove all abbreviated middle names
qdocs$text <- mgsub(paste0(" ",letters,".")," ",qdocs$text)

# remove other custom stop words and garbage left from tables
qdocs$text <- mgsub(c("division ", " economics "," aj "," palumbo "," school "," business "," administration "," duquesne "
                      ," university "," pittsburgh, ", " pennsylvania "," december "," submitted ", " economics "," faculty "
                      ," partial ", " fulfillment ", " requirements ", " degree "," bachelor "," science "," business ",
                      " administration "," BSBA "," mcanulty ","department ", " college ","liberal arts ",
                      "..","*","**","***","()","( )","-","(",")",",,","(.)",",,","-.*",". .",", ,","|","+","_","-","=","["
                      ,"]",". ,",".:",":."," . ","<.","` ",", .",".,",",t ."," ., ",". .","^","'")," ",qdocs$text)

Analyzing words and sentences

Pretty Graphs of Words and Standard Deviations

Now that we have cleaned up our text files we can start making some pretty graphs! At this point we will split each document up by sentence using the sentSplit() function. Once the sentences are split up we will plot them using qdaps built in plotting function for data frames of sentences. You may get a warning message coming up stating “non character, missing end punctuation, etc.” This warning can be due to some bits of the tables still remaining or not completely scrubbing out all of the junk (or scrubbing too much!). We’ll ignore this for now as we can look back at any file in qdocs and see that we have a pretty good general cleaning.

sent.docs <- sentSplit(qdocs,"text")
## Warning in sentSplit(qdocs, "text"): The following problems were detected:
## non character, missing ending punctuation, non space after comma, non ascii, containing escaped, indicating incomplete
## 
## *Consider running `check_text`
plot(sent.docs, grouping.var = "docs")

This plot shows that senior theses normally contain about 3800 words with a max of about 5000 words. Curcio, Vicinie, and Whitaker all won senior thesis with only about 3000 words! My good friend Bill Gangewere and Kandrack both sit right near the median number of words. The large spread in word count tells us that short and long papers both have a pretty good chance at winning thesis. However, it should be noted that most theses do sit around the 3800 word mark so, as theses go, short is sweet.

We can make this graph a bit more illustrative by examining the standard deviation of word counts for each sentence of the thesis. To do this, we will create new columns called scaled and ID. Reading from the inside out we use the function wc() to get the number of words in each sentence. Then we use the function tapply() in order to scale each count between 0 and 1. unlist() is used to put these scaled counts back in a vector which is attached to qdocs. The ID’s are created by something similar, but using seq_along() on the docs to create a vector that counts each sentence.

sent.docs$scaled <- unlist(tapply(wc(sent.docs$text), sent.docs$docs, scale))
sent.docs$ID <- factor(unlist(tapply(sent.docs$docs, sent.docs$docs, seq_along)))

Now we use a big ol’ ggplot() statement to make a pretty graph of the standard deviation of each the word count for each sentence.

## Plot with ggplot2
library(ggplot2); library(grid)

 ggplot(sent.docs, aes(x = ID, y = scaled, fill = docs)) +
  geom_bar(stat ="identity", position ="identity",size = 5) +
  facet_grid(.~docs) + 
  ylab("Standard Deviations") + xlab("Sentences") +
  guides(fill = guide_legend(nrow = 1, byrow = TRUE,title="Author"))+ 
  theme(legend.position="bottom",axis.text = element_blank(), axis.ticks = element_blank()) +
  ggtitle("Standardized Word Counts\nPer Sentence")+coord_flip()

We can see that our good friend Bill Gangewere and Kandrack have the most consistent word count per sentence. Horne, while having the longest thesis, has a pretty consistent word count per sentence besides a place in the middle where things get a bit lengthy. Looking at this graph we can say that most theses have a pretty consistent word count, with one or two places that require a longer explanation.

Using tm to inspect word

At this point we will go back to our docs file so we can start using tm for analysis. There is functionality to make qdap objects into tm objects, but I found it complicated to get the qdap to do what I wanted. We will remove punctuation, our specific stop words, and extra white space. This time we will also stem our words. Stemming means to break words down into their base word. So “stopping”, for example, turns into “stop”.

library(SnowballC)

docs <- tm_map(docs, removePunctuation)

docs <- tm_map(docs, removeWords,c("division ", " economics "," aj "," palumbo "," school "," business "," administration "," duquesne "," university "," pittsburgh, ", " pennsylvania "," december "," submitted ", " economics "," faculty "," partial ", " fulfillment ", " requirements ", " degree "," bachelor "," science "," business "," administration "," BSBA "," mcanulty ","department ", " college ","liberal arts","()","()","( )","-","(",")",",,","(.)",",,","-.*",". .",".:",":.","[","]","<.","` ",", .",".,",",t ."," ., ",". .","^","'","this","the"))

docs <- tm_map(docs, removeWords,c("the","then","this","data", "econom",  "estim",
                                   "expect", "find",  "inform", "level", "measur",
                                   "model", "perform", "research" , "result", "signif",
                                   "studi", "tabl", "variabl","appendix"))

docs <- tm_map(docs, stripWhitespace)

docs <- tm_map(docs, stemDocument)

Now that we have done some cleaning we will create a document term matrix. A document term matrix (DTM) is a matrix with as many columns as unique words and rows as documents. If a word exists in the document term matrix we will count how many times it exists in the document and place that count in our matrix. Otherwise, the cell for a word will hold zero. You may have also heard of sparse matrices. A document term matrix is normally a sparse matrix because few words show up frequently in each document. Each term is also assigned a weight in our DTM that represents how frequently the word is used throughout the documents.

dtm <- DocumentTermMatrix(docs)
rownames(dtm) <- c("Brady","Curcio","France","Gangewere","Horne","Kandrack","Valchev","Vicinie","Whitaker")
dtm
## <<DocumentTermMatrix (documents: 9, terms: 4402)>>
## Non-/sparse entries: 9138/30480
## Sparsity           : 77%
## Maximal term length: 82
## Weighting          : term frequency (tf)

Now that we have a DTM we can use colSums() to find the most frequently viewed words. We will then order the sums and pull out the tail end (because order sorts from least to most) of the first fifteen words.

freq <- colSums(as.matrix(dtm))


ord <- order(freq)


# most frequent terms
freq[tail(ord,n=15)]
Most Frequently Used Words
measur risk result state govern studi breach signific bank expect effect growth econom
187 189 207 217 220 226 234 238 260 261 268 284 311

Looking at this you would think most theses were about banks, breaches, and risk. However, while these terms show up quite a lot, they most likely only exist in one document, or to say these words are sparse. Before we do any further analysis, notice above that our matrix is rather sparse with about 80% of the matrix being zeroes! We can use tm’s removeSparseTerms() function to remove terms which only occur in less than 30% of all instances.

# remove sparse terms
dtms <- removeSparseTerms(dtm, 0.3)
dtms
## <<DocumentTermMatrix (documents: 9, terms: 259)>>
## Non-/sparse entries: 2051/280
## Sparsity           : 12%
## Maximal term length: 16
## Weighting          : term frequency (tf)

Now with some of the sparsity removed we can examine things like correlations between words and words that have a high frequency. Using findFreqTerms() we will examine our new sparse matrix and pull out the terms that have more than 155 uses. After we find that the word ‘state’ is used frequently we can use findAssocs() to find words that have a correlation higher than .6 with ‘state’. After that, we can plot a pretty graph of frequent terms and restrict the graph to only show terms that have a correlation above .6.

# find terms with higest frequency
# low freq chosen to produce most frequent
findFreqTerms(dtms, lowfreq=15)

# find words with high correlation to state
findAssocs(dtms,"state", corlimit=0.7)

# make a plot of freq terms with correlation above .6
plot(dtms,terms =findFreqTerms(dtms, lowfreq=100),corThreshold=0.6)

Most Frequently Used Words
Frequency.1 Frequency.2 Frequency.3 Frequency.4 Frequency.5 Frequency.6 Frequency.7
econom effect estim expect growth individu inform

Words associated with State
size appli growth compon vari unit there econom interest analysi general
state 0.97 0.84 0.84 0.83 0.83 0.82 0.81 0.8 0.79 0.78 0.72
## Loading required package: Rgraphviz
## Loading required package: graph

The frequent terms were cut to 10 in order to save space. With the most frequently used words being the stems of words like growth, effect, individual, and time we can say that most winning theses research growth and effects of economic variables and data on individuals over time. States tend to be a variable grouping and the size of states also seems to matter. Examining this plot it appears that regress is very involved with errors, observations, significance, and sample. This hints towards the idea that most winning theses are regression based. Test and error are connected and so a winning senior thesis may also perform strenuous checks on the error terms of their regression. Most analysis seem to be over states and time.

The last thing we will do is examine the number of letters in each word by Senior Thesis. To do this we will use the R package reshape2 to turn our wide column format matrix into a long column format matrix. Essentially, when we turn a matrix into its long format we take the column names, make a vector out of them, and match each column name to each word and count of that word for the author. Then we use the column named docs as our faceting variable and make the pretty graph below.

library(reshape2)

dtm.mat <- as.matrix(dtm)

dtm.melt <- melt(dtm.mat)

dtm.melt <- as.data.frame(dtm.melt)

knitr::kable(head(dtm.melt,n=10))
Docs Terms value
Brady abagail 0
Curcio abagail 0
France abagail 0
Gangewere abagail 1
Horne abagail 0
Kandrack abagail 0
Valchev abagail 0
Vicinie abagail 0
Whitaker abagail 0
Brady abbrevi 0
# make table of word counts
term.table <- table(nchar(as.character(dtm.melt$Terms)))

# get mode of word counts
term.mode <- as.numeric(names(term.table[which(term.table==max(term.table))]))

# make condition to highlight mode
cond <- nchar(as.character(dtm.melt$Terms))*dtm.melt$value == term.mode

#make a pretty graph
ggplot(dtm.melt) + geom_histogram(data=subset(dtm.melt,cond==FALSE),binwidth=1,aes(x=nchar(as.character(Terms))*value,fill=Docs))+
  facet_grid(Docs~.) + geom_histogram(data=subset(dtm.melt,cond==TRUE),binwidth=1,aes(x=nchar(as.character(Terms))*value),title="Mode",fill="red")+
  labs(x="Number of Letters",y="Number of Words") + xlim(1,20)  +
  guides(fill = guide_legend(nrow = 9, byrow = TRUE ,title="Author"))+ 
  theme(axis.text.y = element_blank(), axis.ticks.y = element_blank()) +
  ggtitle("Length of each word \n by Author")+
    geom_text(data=data.frame(x=6.5, y=30, label="Mode", stat=c("ta")),aes(x,y,label=label),size=3, inherit.aes=TRUE)

The mode, the most frequent number of letter per word, is six. So most words tend to be relatively short. This graph alludes to the idea that the best theses are ‘short and sweet’. The longer words are most likely statistical terms like “heteroskedasticity” and “endogeneity”. Taking everything into account, we can say that a winning senior thesis can be long or short, will be consistent in the number of words per sentence, will use short words over longer ones, and will be a topic that looks at economic variables over U.S. states and time. So maybe next years winning thesis will be “Measuring the Expected Size of Economic Growth by State”.

I’m satisfied with R’s NLP packages and think it will continue to be one of the go to programs for this type of analysis. One of R’s main benefits that shines in NLP is how easy it is to make pretty graphics. When working with large documents, summarizing the data in a graph allows researchers to explore the data in a concise manner.