Deadlifts and Derivatives

Updates and Research from Steve Bronder

The purpose of this site is to give current information on Steve Bronder's (me) research and personal life

Introduction to R: Graphs and Maps

Introduction to R: Graphs and Maps

Introduction

The purpose of this tutorial is to be an introductory text for the R programming language’s graphing and mapping features. R is a free software programming language and software environment for statistical computing and graphics. While it performs similar functions to packages such as SAS and STATA, the systems have specific strengths and weaknesses. This paper uses ggplot2 and ggmap to show R’s graphing capabilities. This document is structured as follows. Section two will briefly explain the S language, how R processes input, installing packages, how to find help while exploring R, and importing SAS, Excel, and csv datasets.

Section three is comprised of introductory statistics and data manipulation. Simple summary statistics will first be introduced which will then be used to describe \(R\)‘s second level functions such as the apply family. A simple custom function will also be written. Section four will use the ggplot2 package to create simple graphs and maps. Throughout this document the data set ’iris’ will be used as it comes available with R. This document is made to be worked along with. It is highly encouraged to run the R code while going through this tutorial.

R and SAS Language

This section will cover the basics of the R computing language, importing and exporting files, and installing packages. There is one main difference which gives R and SAS their respective advantage. When each package was first built, R was made for small geological and natural datasets which were easier to manipulate with an interpreter style language. SAS was made for businesses with larger, dynamic datasets. As such, SAS was made in the style of a compiled language. The lines between what is compiled and intererpreted now are fuzzy, but when it comes to processes such as for loops, R is not as efficient as SAS. To import a datafile, name it shoes, specify a csv, replace existing file, and specify no column names in SAS input.

proc import datafile=“C:.csv” out=shoes dbms=csv replace; getnames=no; run;

proc print data=test; run;

This chunk of code is compiled without any further user instructions. While this is possible in R with its secondary functions, R code to do the same thing looks as such.

shoes<- read.csv(file=“C:/temp/test.csv” ,header=FALSE) shoes<-shoes[2:nrow(shoes),]

Each line of code is interpreted one at a time and is slightly slower than SAS. The read.csv() function specifics the file and tells \(R\) there are no column headers. The second line tells \(R\) to take the object shoes and only keep rows 2 through N, N being the total number of rows. A major difference between \(SAS\) and R is that all R functions not stored as objects are erased immediately. Without the object attached to the code such as

read.csv(file=“C:/temp/test.csv” ,header=FALSE)

R would simply display the csv as a data frame in the console without creating an object for it. At the bottom of this section is a table which covers some basic info for both languages.

Looking at the code above it seems that with spacing there is not much of a difference between the two languages. However, keep in mind R interprets row by row and SAS interprets chunk by chunk. Spend a moment thinking how a for loop would be interpreted in each. It becomes very clear that a process that chunks code versus going line by line is going to be considerably faster in a compiled processes. While there are ways around this in R, it should be noted SAS language has an advantage in iterative processes.

Unlike SAS, R is not one giant package, but is a connected network of packages written and approved to be uploaded onto the Comprehensive \(R\) Archive Network, or CRAN. Packages can be downloaded from \(R\) by typing:

install.packages(“packagename”)

Loading the package after download is just one more line of code.

library(package)

Note that installing requires a string and loading requires an object. Rstudio is highly recommended for the purpose of easily keeping track of packages.

Getting Help

Unlike SAS, R has no live support. Most users learn R from guides such as this. There are several wonderful books and classes available online that are made to teach R. Some of these include Data Mining with R by Luis Torgo, Code schools Online course, and for advanced R programming methods the in progress book Advanced R available at the sites webpage. Documentation is kept of each package on CRAN, but “help” files can tend to appear discouragingly difficult to interpret by newcomers. My best advice would be to work through a package’s documentation. With many examples available online such as ggplot2 or tseries. Many examples are available online from websites such as R-bloggers and . Answers to most questions can be found through either google or posting it to Stackoverflow. Inside of R find help by typing:

?help

For particular functions, such as in the example, it is possible to do the same thing to reach the help section for each function.

?read.csv

Nothing is particularly rocket science and with a little searching, and perhaps writing a new function or two, any task can generally be done in R. On one final note, it should always be recognized that there are times other statistical packages may simply be set up to perform a task in a more efficient or easier manner than R. Between the choice of writing and testing new R code and using another package, I prefer the latter.

Product Developer Written Scripts Processing
SAS SAS Institute PL/I, Fortran, Assembly None Compiler
R R Foundation C, Fortran, S R. Python, Perl Interpreter

Importing and Exporting Data

In this section the article examines how to import data for analysis. It is easy to import a basic csv into R, but what about Excel spreadsheets, SPSS, STATA, or SAS files? The R packages foreign and RODBC will be used to import data very easily. Like the csv, only one line of code is involved inside of R. For each statistical package the dataset needs to be placed into a transportable format. Here is an example of how this is done in SAS.

libname out xport ‘c:/mydata.xpt’; data out.mydata; set sasuser.mydata; run;

The first line lets SAS know to export the file mydata.xpt. The second line tells SAS to use the data mydata for export. The set function is used to tell SAS to export all of the data, but could have also been used for concatenating data sets or subsetting mydata before export. Once the data is transfereable the code for R is simple.

sasdata <- read.xport( “C:/temp/mydata.xpt”)

Each of the other main statistical packages have a similar call within program and a simple line for R to use. It’s usually something like read.stuff(). guessing at the help file (?read.table) can usually get pretty close to the answer.Click here for a link to import methods via Quick-R page on importing and exporting data. Once the data is loaded in R it is possible to view the first few lines with the function head().

head(mydata)

This function will display the first 10 rows or so of the data set. The following code takes a data table and exports it to a csv.

write.csv(mydata,“C:/myRdata.csv”)

There will be times when something other than a csv is necessary, but for this introductory lesson it surely serves the purpose of explaining how to take data out of R.

Manipulating Data

For this section use the dataset iris. The data set contains 3 classes of 50 observations each, where each class refers to a type of iris plant. Many examples use this or the mtcars dataset because they come pre-installed with R. To load the iris dataset use the data() function. The following code finds quick subsets of data which are useful in the rest of this analysis.

First load the data.

data(iris) 

Use the head function to see the first few observations.

head(iris)
##   Sepal.Length Sepal.Width Petal.Length Petal.Width Species
## 1          5.1         3.5          1.4         0.2  setosa
## 2          4.9         3.0          1.4         0.2  setosa
## 3          4.7         3.2          1.3         0.2  setosa
## 4          4.6         3.1          1.5         0.2  setosa
## 5          5.0         3.6          1.4         0.2  setosa
## 6          5.4         3.9          1.7         0.4  setosa

By specifying $ it is possible to specify just the species column in head

head(iris$Species) 
## [1] setosa setosa setosa setosa setosa setosa
## Levels: setosa versicolor virginica

To create a new data frame of the top half of iris we use the index of the iris data frame. Here 1:I(nrow(iris)/2) tells R to take the first row and the row that divides the data. the I() allows R to interpret the division of two literally. The blank after the comma tells R we want all columns.

iris.top<- iris[1:I(nrow(iris)/2),] 

Knowing the dimensions of your data frame makes manipulation much easier. dim is used to take the dimensions of iris.top. [1] specifies we want the first list in dims output, the number of rows in iris.top. [2] specifies we want the second list in dims output, the number of columns in iris.top.

dim(iris.top)[1] 
## [1] 75
dim(iris.top)[2] 
## [1] 5
sub.iris<-subset(iris,select=-Species) 
iris.num<-iris[,1:4] 

The above code starts by using the data function to upload iris to the global enviroment. If you are using Rstudio iris should come up in the global enviroment.

The head() function allows us to examine the first few top rows. There are 5 columns of 150 rows with Species being the class seperator. It is also possible to use the head() function on a single column to examine a single row. Attaching $Species to iris allows us to access the column labeled species. Here iris$Species is a factor with three different levels.

Another way to subset the data is to call specific rows and columns from the object. This takes the form \(object[N_{t}:N_{t+k},M_{v}:M_{v+k}]\) on the end of the data object where N is rows and M are columns. In iris.top, the function takes the first half of the rows and since no columns are specified R takes all of them. The function I() inside of the call forces R to treat everything inside of it with standard PEMDAS rules. This is complicated. If you ever get something weird from a simple multiply or divide, put an I() around it.. iris.top successfully pulled out the top half of the data set as proven by the dim function. The dim() function accepts a dataset and will either return both the number of columns and rows or it is possible to specify, like in the example, by saying one for rows or two for columns. Why did iris.sub fail? Note that the number of columns selected are more than the possible selection of columns and so R stops. To include more columns on this dataset for later it is necessary to create them and either merge them onto the existing database after creation or create columns of 1s to multiply by later.

Statistics and Aggregates

This sections code shows how to create some summary statistics that describe the iris dataset. The R function summary() can be used to find quartiles, min, max, average, and median of the dataset. This is similar to SAS’s means function.

summary(iris)
##   Sepal.Length   Sepal.Width    Petal.Length   Petal.Width 
##  Min.   :4.30   Min.   :2.00   Min.   :1.00   Min.   :0.1  
##  1st Qu.:5.10   1st Qu.:2.80   1st Qu.:1.60   1st Qu.:0.3  
##  Median :5.80   Median :3.00   Median :4.35   Median :1.3  
##  Mean   :5.84   Mean   :3.06   Mean   :3.76   Mean   :1.2  
##  3rd Qu.:6.40   3rd Qu.:3.30   3rd Qu.:5.10   3rd Qu.:1.8  
##  Max.   :7.90   Max.   :4.40   Max.   :6.90   Max.   :2.5  
##        Species  
##  setosa    :50  
##  versicolor:50  
##  virginica :50  
##                 
##                 
## 

Using the summary() function returns the summary statistics of the iris dataset. This function is useful and informative, but what if the analysis requires the mode, standard deviation, or range of each column? In SAS, PROC MEANS is used to calculate each of these. In R, the function does the same thing. Below is an example.

Take the means of each column.

cmode.iris<-apply(sub.iris,2,mean) 

Take standard deviation of the columns

cstd.iris<-apply(sub.iris,2,sd) 

Find range of each column, which should be equal.

crange.iris<-apply(sub.iris,2,range)#Find range of columns
cmode.iris
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##        5.843        3.057        3.758        1.199
cstd.iris
## Sepal.Length  Sepal.Width Petal.Length  Petal.Width 
##       0.8281       0.4359       1.7653       0.7622
crange.iris
##      Sepal.Length Sepal.Width Petal.Length Petal.Width
## [1,]          4.3         2.0          1.0         0.1
## [2,]          7.9         4.4          6.9         2.5

Start with the dataset created above, sub.iris. To perform an operation on the columns of the data frame specify 2 as the second command. To do a row-by-row operation specify the command as 1.It is possible to perform functions on arrays and receive a matrix by using c(1,2). This means preserve the rows and columns. Finally, input whatever kind of function to perform over the columns or rows. The apply() function is much faster than traditional for loops in R. As such, try to impliment them everytime a for loop is normally used. apply() is also very flexible. The dataset below gives an example of what a base apply() function can do with a bit of know-how. The objective of diff.iris() is to subtract each element of the dataset from one another without subtracting from itself. Notice that it is possible to write custom functions within apply(). To do this, specify a function and the necessary parameters to perform it. In this example the x parameter is sub.iris. The apply() statement is used to fill the dataset with all combinations of sub.iris columnwise. Next, just like in the previous example, the function specifies that the columns are to be manipulated. Finally, the function skips ahead to the next section and creates a function to subtract the first column of sub.iris from the second column of sub.iris. The next line does something similar, but uses a paste() function with a comb() function to get the column names.

Take the mean of each row

rmean.iris<-apply(sub.iris,1,mean)

Use the apply function on all combinations of sub.iris to take the difference of all combinations of sub.iris.

diff.iris <- apply(combn(ncol(sub.iris), 2), 2,
    function(x) sub.iris[,x[1]] - sub.iris[,x[2]])

diff.iris currently has no column names. The next function uses a similar apply method as above to take the column names.

colnames(diff.iris) <- apply(combn(ncol(sub.iris), 2), 2,
    function(x) paste(names(sub.iris)[x], collapse=' - '))

R’s built in aggregate function allows aggregation by any built in R function or custom function.

agg.species<-aggregate(cbind(Sepal.Length, Sepal.Width,
                  Petal.Length, Petal.Width)~Species,data=iris,FUN=sum)

The function aggregate() is similar to the by function in SAS. Take the dataset iris and aggregate it by species. This function uses what is known as model syntax where an independent variable is distributed by a dependent variable. First, cbind() is used to concactenate the columns to aggregate. Anything after the tilde sign is used to aggregate the data. FUN is used to specify how we want to aggregate the data. Any base function such as mean() or sum() can be used here as well as custom functions. This function is very useful for taking state data and aggregating it to the national level or can also be used to move from weekly sales to monthly, yearly, etc.

In the apply() functions previously used, custom functions were built in order to perform the operation. One of the largest perks of R is the ability to easily create these custom functions. Because all of R is open source it is possible to type in the name of any function and the underlying code will print out. Try this with apply() or a similar function to get a feel for how R syntax should look. Below is a simple example of a custom function to demean the data and its apply() equivalent.

demean.dat<-function(x){
#take mean of columns
mean.dat<-apply(x,2,mean) 

#sweep subtracts mean from x columns
de.dat<-sweep(x,2,mean.dat,FUN="-") 
return(de.dat)
}

demean.iris<- demean.dat(sub.iris) #Call custom function

Here is the apply equivalent.

meapply.iris<-apply(sub.iris,2,function(x) x - mean(x))

#Proves they are both the same
sum(demean.iris-meapply.iris)
## [1] 0

The function starts creating an object demeand.dat. The \(x\) inside of the function call is a placeholder. Anything that needs to be specified for the user can be placed inside of the parenthesis. All functions are placed in between brackets are written like in the R console. Whatever the function should output is placed inside of return().

return(list(data=data,Rsquared=Rvalue, oranges=apples))

In the first example, one object was returned so it was placed in demean.dat. If multiple objects were to be returned it is necessary to create a list such as the second example. Subsets of the list would be accessed by typing something like demean.dat$data. This looks weird, but you have to keep in mind everything in R is kind of a list and this is how to access sub lists.

Plots and Maps

Plotting with ggplot2

This section will cover basic graphing and mapping using the R packages ggplot2 and ggmap. I will not be able to cover everything, but with this example you will be able to utilize most of the functions available at the ggplot2 documentation. All ggplot2 functions work for ggmap. ggplot2 is one of R’s most notable packages because of its simplicity and sophistication. One of the more interesting benefits of the gg family packages is that maps and graphs are stored as objects. This allows additions to the graphs to be included like in a model. An example will also be given to show that R does not automatically update objects when underlying objects have been changed.

First load ggplot2 into the global enviroment. Using the qplot function we create a scatterplot of Sepal length and Petal width.

library(ggplot2)
s.colour<-qplot(Sepal.Length, Petal.Length, data = iris,
                colour = Species,size = Petal.Width,alpha=I(.7))
s.colour

plot of chunk unnamed-chunk-15

Take the s.colour object and add labels for the x and y axis as if we are adding variables in a regression.

I.colour<-s.colour+xlab("Sepal Length") + ylab("Petal Length")
I.colour

plot of chunk unnamed-chunk-16

To create a more informative graph include the size variable to make each observation a different size depending on petal length. Also note that I.colour will not be updated.

s.colour<-qplot(Sepal.Length, Petal.Length, data = iris, colour = Species
                ,size = Petal.Length,alpha=I(.7))

I.colour

plot of chunk unnamed-chunk-17 If colour is not an option, use facets for the plot to break out each species.

s.colour<-qplot(Sepal.Length, Petal.Length, data = iris,
                ,size = Petal.Length,alpha=I(.7),facets=~Species)
s.colour

plot of chunk unnamed-chunk-18

Use the function qplot() to make a scatterplot of Sepal.Length and Petal.Length from iris. To specify which species each point comes from specify the colour to be the type of Species. It is possible to also do this with size, however this function uses Petal.Width. Alpha is used so that as points come closer together they are made less opaque. Displaying this result gives the basic qplot graph. To label the graph attach xlab and ylab to the object and store these new commands as I.colour as seen on page 8. Note that R objects do not automatically update whenever subobjects or functions have been manipulated.

Mapping with ggmap

ggmap, a package by the creator of ggplot2, gives R smooth and easy mapping capabilities. The following code will take a csv file of addresses, turn them into coordinates, plot their real points, and create a heat map of their density. This is all done within the context of ggmap.

load the ggmap library and address csv. I do not provide a dataset of addresses, however I do provide the finished coordinates here

library(ggmap)
add<-"C:/Users/brond_000/Documents/R/WorkingD/Addr.csv"
dff<-read.csv(add,header=TRUE,stringsAsFactors=FALSE)

The columns may come in a factors so labeling the address variable will ensure the geocode function will work. The geocode function accesses the Google maps API, allowing automatic conversion of the street addresses to coordinates. Any poorly coded addresses will bring back NAs.

dff$Address<-as.character(dff$Address)

coords<- geocode(dff$Address)

For further mapping, it is important to remove NAs from the dataset

coords2<-na.omit(coords)

Header has already been mentioned and is specified as true because this matrix has column names. \(R\) automatically converts strings into factors. This was not a problem whenever the strings were factors like in iris, but for addresses it is important to set stringAsFactors to FALSE. Just to make sure the addresses are a character vector set the column Address as a character vector. The addresses are translated to coordinates through a GoogleMaps API call. Once all the coordinates are collected, use the head() function to take a peek at the data. Notice there are some N/As from bad addresses. The mapping function can have some difficulty with this so remove them with the na.omit() function. The following code will make a map of the US, place points at the coordinates in the dataset, and finally make a heat map that shows the clusters of coordinates.

Load up the dataset like normal, specifying stringsAsFactors=FALSE to tell R “If there are strings in the Data do not convert them to factors.”

library(ggmap)

crd<-"C:/Users/brond_000/Documents/R/WorkingD/coords22.csv"

coords2<-read.csv(crd,
                  ,header=TRUE,stringsAsFactors=FALSE)

We can use the get_map() function to receive a map with the middle of the map equal to the mean of the latitude and longitude of the dataset.

mapgilbert <- get_map(location = c(lon = mean(coords2$lon)-2,
            lat = mean(coords2$lat))-1, zoom = 4, scale = 1)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=36.021733,-97.064618&zoom=4&size=%20640x640&scale=%201&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms

Similar to I.colour well add geom_point() in order to place a point on the map for each observation.

DotMap<- ggmap(mapgilbert) + geom_point(data = coords2,aes(x = lon, y = lat,
    fill = "red", alpha = 0.7),
    size = 1, shape = 22) +guides(fill=FALSE, alpha=FALSE, size=FALSE)
DotMap
## Warning: Removed 12 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-24 We will create the final map by adding *geom_density2d** to Dotmap. This will use the number of observations in an area as the density to create a heatmap.

FinalMap<-DotMap+geom_density2d(data = coords2, aes(x = lon, y = lat,fill="white"))
FinalMap
## Warning: Removed 12 rows containing non-finite values (stat_density2d).
## Warning: Removed 12 rows containing missing values (geom_point).

plot of chunk unnamed-chunk-25 After reading in the coordinates the function makes an object arbitrarily titled ‘mapgilbert’. Using the get_{}map() function a map layout ggmap can read is returned with a center at the means of the latitude and longitude. Zoom and scale control for size. The function c() is used to concactenate single numbers together. Mapgilbert is plotted with ggmap and geom_{}point() is used to place dots on the map, with x and y being the longitude and latitude. It is important to note, parameters such as fill, alpha, and size have the same definiton as in ggplot2. Guides() is an additional function to control whether a legend exists or not. Finally. Use geom_density2d() to create the heatmap.

Conclusion

This document gives examples related to data manipulation, importing, exporting, function creation, and basic graphics. In terms of R usability, this is just the beginning. Packages such as knitR can be used to write documents, such as this one, completely and easily from within R. bayesm and MCMCpack are packages that performs Bayesian Hierarchical models, Gibbs Sampling, Bayesian Logit regressions, and much more. A package named dplyr allows data manipulation in a parallel format. With the only cost to access being time, learning the R language is a wise investment.