Updates and Research from Steve Bronder

The purpose of this site is to give current information on Steve Bronder's (me) research and personal life

# Visualizing Payments to Doctors With Package Statebins

## Introduction

The purpose of this post is to explain how to graph topological data with the statebins package. To do this we will play with General Payment Data for non-research/ownership payments to physicians and teaching hospitals. This data was recently released and, in short, contains the data for “gifts” pharma companies and others give to doctors and teaching hospitals because they are just great people. The data used throughtout this tutorial can be found on the open payments data section of the center for Medicare and Medicaid Services here.

Statebins is an R package that produces choropleth maps for US states. These maps preserve the geographic placement of states, but have the look and feel of a traditional heatmap. This package is based on work by the Washington Post graphics department in their report on The States Most Threatened by Trade. Functions allow binned, discrete, and continuous scale heat maps. We will example two of these through binning the number of doctors receiving payments in each state as well as the mean and total payments in each state.

Lets get started, lets load in the data. The data is one GB in size so we will use the fread() function in the package data.table. stringsAsFactors is by default FALSE, however, call me old fashioned because we are going to put it in anyway.

library(statebins)
library(data.table)
pharm.data <- fread("./General_Payment_Data_2013.csv",header=TRUE,stringsAsFactors=FALSE,showProgress=FALSE)

## Plots and Maps

### Plotting with ggplot2

This section will cover basic graphing and mapping using the R packages ggplot2 and ggmap. I will not be able to cover everything, but with this example you will be able to utilize most of the functions available at the ggplot2 documentation. All ggplot2 functions work for ggmap. ggplot2 is one of R’s most notable packages because of its simplicity and sophistication. One of the more interesting benefits of the gg family packages is that maps and graphs are stored as objects. This allows additions to the graphs to be included like in a model. An example will also be given to show that R does not automatically update objects when underlying objects have been changed.

First load ggplot2 into the global enviroment. Using the qplot function we create a scatterplot of Sepal length and Petal width.

library(ggplot2)
s.colour<-qplot(Sepal.Length, Petal.Length, data = iris,
colour = Species,size = Petal.Width,alpha=I(.7))
s.colour

Take the s.colour object and add labels for the x and y axis as if we are adding variables in a regression.

I.colour<-s.colour+xlab("Sepal Length") + ylab("Petal Length")
I.colour

To create a more informative graph include the size variable to make each observation a different size depending on petal length. Also note that I.colour will not be updated.

s.colour<-qplot(Sepal.Length, Petal.Length, data = iris, colour = Species
,size = Petal.Length,alpha=I(.7))

I.colour

If colour is not an option, use facets for the plot to break out each species.

s.colour<-qplot(Sepal.Length, Petal.Length, data = iris,
,size = Petal.Length,alpha=I(.7),facets=~Species)
s.colour

Use the function qplot() to make a scatterplot of Sepal.Length and Petal.Length from iris. To specify which species each point comes from specify the colour to be the type of Species. It is possible to also do this with size, however this function uses Petal.Width. Alpha is used so that as points come closer together they are made less opaque. Displaying this result gives the basic qplot graph. To label the graph attach xlab and ylab to the object and store these new commands as I.colour as seen on page 8. Note that R objects do not automatically update whenever subobjects or functions have been manipulated.

### Mapping with ggmap

ggmap, a package by the creator of ggplot2, gives R smooth and easy mapping capabilities. The following code will take a csv file of addresses, turn them into coordinates, plot their real points, and create a heat map of their density. This is all done within the context of ggmap.

load the ggmap library and address csv. I do not provide a dataset of addresses, however I do provide the finished coordinates here

library(ggmap)
dff<-read.csv(add,header=TRUE,stringsAsFactors=FALSE)

The columns may come in a factors so labeling the address variable will ensure the geocode function will work. The geocode function accesses the Google maps API, allowing automatic conversion of the street addresses to coordinates. Any poorly coded addresses will bring back NAs.

dff$Address<-as.character(dff$Address)

coords<- geocode(dff$Address) For further mapping, it is important to remove NAs from the dataset coords2<-na.omit(coords) Header has already been mentioned and is specified as true because this matrix has column names. $$R$$ automatically converts strings into factors. This was not a problem whenever the strings were factors like in iris, but for addresses it is important to set stringAsFactors to FALSE. Just to make sure the addresses are a character vector set the column Address as a character vector. The addresses are translated to coordinates through a GoogleMaps API call. Once all the coordinates are collected, use the head() function to take a peek at the data. Notice there are some N/As from bad addresses. The mapping function can have some difficulty with this so remove them with the na.omit() function. The following code will make a map of the US, place points at the coordinates in the dataset, and finally make a heat map that shows the clusters of coordinates. Load up the dataset like normal, specifying stringsAsFactors=FALSE to tell R “If there are strings in the Data do not convert them to factors.” library(ggmap) crd<-"C:/Users/brond_000/Documents/R/WorkingD/coords22.csv" coords2<-read.csv(crd, ,header=TRUE,stringsAsFactors=FALSE) We can use the get_map() function to receive a map with the middle of the map equal to the mean of the latitude and longitude of the dataset. mapgilbert <- get_map(location = c(lon = mean(coords2$lon)-2,
lat = mean(coords2\$lat))-1, zoom = 4, scale = 1)
## Map from URL : http://maps.googleapis.com/maps/api/staticmap?center=36.021733,-97.064618&zoom=4&size=%20640x640&scale=%201&maptype=terrain&sensor=false
## Google Maps API Terms of Service : http://developers.google.com/maps/terms

Similar to I.colour well add geom_point() in order to place a point on the map for each observation.

DotMap<- ggmap(mapgilbert) + geom_point(data = coords2,aes(x = lon, y = lat,
fill = "red", alpha = 0.7),
size = 1, shape = 22) +guides(fill=FALSE, alpha=FALSE, size=FALSE)
DotMap
## Warning: Removed 12 rows containing missing values (geom_point).

We will create the final map by adding *geom_density2d** to Dotmap. This will use the number of observations in an area as the density to create a heatmap.

FinalMap<-DotMap+geom_density2d(data = coords2, aes(x = lon, y = lat,fill="white"))
FinalMap
## Warning: Removed 12 rows containing non-finite values (stat_density2d).
## Warning: Removed 12 rows containing missing values (geom_point).

After reading in the coordinates the function makes an object arbitrarily titled ‘mapgilbert’. Using the get_{}map() function a map layout ggmap can read is returned with a center at the means of the latitude and longitude. Zoom and scale control for size. The function c() is used to concactenate single numbers together. Mapgilbert is plotted with ggmap and geom_{}point() is used to place dots on the map, with x and y being the longitude and latitude. It is important to note, parameters such as fill, alpha, and size have the same definiton as in ggplot2. Guides() is an additional function to control whether a legend exists or not. Finally. Use geom_density2d() to create the heatmap.

### Conclusion

This document gives examples related to data manipulation, importing, exporting, function creation, and basic graphics. In terms of R usability, this is just the beginning. Packages such as knitR can be used to write documents, such as this one, completely and easily from within R. bayesm and MCMCpack are packages that performs Bayesian Hierarchical models, Gibbs Sampling, Bayesian Logit regressions, and much more. A package named dplyr allows data manipulation in a parallel format. With the only cost to access being time, learning the R language is a wise investment.