R Project Sample Code That Read and Plots
Reporting data results #1
The video lectures for this chapter are embedded at relevant places in the text, with links to download a pdf of the associated slides for each video. You can also access a full playlist for the videos for this chapter.
Guidelines for adept plots
Download a pdf of the lecture slides for this video.
There are a number of very thoughtful books and manufactures about creating graphics that effectively communicate data. Some of the authors I highly recommend (and from whose work I've pulled the guidelines for adept graphics nosotros'll talk about this calendar week) are:
- Edward Tufte
- Howard Wainer
- Stephen Few
- Nathan Yau
You should programme, in particular, to read The Visual Display of Quantitative Data by Edward Tufte earlier you lot graduate.
This week, nosotros'll focus on six guidelines for good graphics, based on the writings of these and other specialists in data display. The guidelines are:
- Aim for high data density.
- Utilise clear, meaningful labels.
- Provide useful references.
- Highlight interesting aspects of the information.
- Make lodge meaningful.
- When possible, use small multiples.
For the examples, I'll utilize dplyr
for information cleaning and, for plotting, the packages ggplot2
, gridExtra
, and ggthemes
.
library(tidyverse) ## Loads `dplyr` and `ggplot2` library(gridExtra) library(ggthemes)
You can load the data for today's examples with the following code:
library(faraway) information(nepali) data(worldcup) library(dlnm) data(chicagoNMMAPS) chic <- chicagoNMMAPS chic_july <- chic %>% filter(month == 7 & year == 1995)
High data density
Guideline 1: Aim for high data density.
Download a pdf of the lecture slides for this video.
Y'all should try to increase, as much as possible, the data to ink ratio in your graphs. This is the ratio of "ink" providing information to all ink used in the effigy. One manner to think about this is that the just graphs you lot make that use upwardly a lot of your printer's ink should be packed with information.
The ii graphs in Figure 4.1 show the same information, but use very different amounts of ink. Each shows the number of players in each of iv positions in the worldcup
dataset. Observe how, in the plot on the correct, a unmarried dot for each category shows the same data that a whole filled bar is showing on the left. Farther, the plot on the correct has removed the gridded groundwork, removing even more "ink."
Figure four.ii gives another example of 2 plots that show the same data but with very different data densities. This figure uses the chicagoNMMAPS
information from the dlnm
parcel, which includes daily mortality, weather, and air pollution information for Chicago, IL. Both plots show daily mortality counts during July 1995, when a very severe heat wave hit Chicago. Notice how many of the elements in the plot on the left, including the shading under the bloodshed time series and the colored background and filigree lines, are unnecessary for interpreting the message from the information.
By increasing the information-to-ink ratio in a plot, yous can help viewers see the bulletin of the data more than quickly. A cluttered plot is harder to interpret. Farther, you exit room to add some of the other elements I'll talk about, including highlighting interesting data and calculation useful references. Notice how the plots on the left in Figures iv.1 and four.2 are already cluttered and leave piffling room for calculation actress elements, while the plots on the right of those figures have much more room for additions.
Ane quick way to increment information density in ggplot2
is to change the theme for the plot. The theme specifies a number of the "groundwork" elements to a plot, including elements like the plot grid, background color, and the font used for labeling. Some themes come with ggplot2
, including:
-
theme_bw
-
theme_minimal
-
theme_void
You can detect more themes in packages that extend ggplot2
. The ggthemes
package, in particular, has some fantabulous additional themes.
Figures 4.three shows some examples of the effects of using different themes. All show the aforementioned information– a plot of daily deaths in Chicago in July 1995. The top left graph shows the graph with the default theme. The other plots show the effects of adding different themes, including the black-and-white theme that comes with ggplot2
(summit correct) and various themes from the ggthemes
package. You can even utilize themes to add some questionable choices for different elements, similar the Excel theme (lesser left).
Meaningful labels
Guideline 2: Utilize clear, meaningful labels.
Download a pdf of the lecture slides for this video.
Graphs often default to apply abbreviations for axis labels and other labeling. For example, the default is for ggplot2
plots to utilise column names for the x- and y-axes of a scatterplot. While this is user-friendly for exploratory plots, it's often not acceptable for plots for presentations and papers. You'll want to use short and easy-to-blazon cavalcade names in your dataframe to make coding easier, merely y'all should employ longer and more meaningful labeling in plots and tables that others need to interpret.
Furthermore, text labels can sometimes be aligned in a way that makes them difficult to read. For example, when plotting a chiselled variable along the x-centrality, it can be hard to fit labels for each category that are long plenty to be meaningful.
Figure four.iv gives an instance of the same information shown with labels that are harder to interpret (left) versus with clear, meaningful labels (right). Notice how the graph on the left is using abbreviations for the categorical variable ("DF" for "Defence"), abbreviations for axis labels ("Pos" for "Position" and "Pls" for "Number of players"), and has the player position labels in a vertical alignment. On the right graph, I take fabricated the graph easier to quickly read and interpret past spelling out all labels and switching the ten- and y-axes, so that at that place'south room to fully spell out each position while still keeping the alignment horizontal, then the reader doesn't have to turn the page (or their head) to read the values.
There are a few strategies you can utilise to make labels clearer when plotting with ggplot2
:
- Add
xlab
andylab
elements to the plot, rather than relying on the cavalcade names in the original data. You tin can too relabel x- and y-axes withscale
elements (e.g.,scale_x_continuous
), and thescale
functions requite you lot more power to also brand other changes to the x- and y-axes (e.g., irresolute break points for the axis ticks). However, if you just demand to change axis labels,xlab
andylab
are often quicker. - Include units of measurement in centrality titles when relevant. If units are dollars or percent, check out the
scales
parcel, which allows you lot to add together labels direct to axis elements by including arguments likelabels = percent
inscale
elements. See the helpfile forscale_x_continuous
for some examples. - If the x-variable requires longer labels, every bit is frequently the example with categorical data (for example, player positions Figure iv.4), consider flipping the coordinates, rather than abbreviating or rotating the labels. You can use
coord_flip
to do this.
References
Guideline 3: Provide useful references.
Data is easier to interpret when yous add together references. For example, if you testify what it typical, it helps viewers translate how unusual outliers are.
Figure 4.5 shows daily bloodshed during July 1995 in Chicago, IL. The graph on the correct has added shading showing the range of daily death counts in July in Chicago for neighboring years (1990–1994 and 1996–2000). This added reference helps clarify for viewers how unusual the number of deaths during the July 1995 heat wave was.
Some other useful fashion to add references is to add together a linear or polish fit to the data, to help clarify trends in the data. Effigy 4.6 shows the relationship between passes and shots for Forwards in the worldcup
dataset. The plot on the right has added a smooth function of the relationship between these 2 variables.
For scatterplots created with ggplot2
, you tin use the role geom_smooth
to add together a smooth or linear reference line. Here is the lawmaking that produces Figure 4.6:
ggplot(filter(worldcup, Position == "Forward"), geom_point(size = ane.5) + theme_few() + geom_smooth()
The almost useful geom_smooth
parameters to know are:
-
method
: The default is to add a loess curve if the data includes less than g points and a generalized additive model for 1000 points or more than. However, you tin change to show the fitted line from a linear model usingmethod = "lm"
or from a generalized linear model usingmethod = "glm"
. -
span
: How wiggly or smooth the smooth line should be (smaller value: more wiggly; larger value: more polish) -
se
: TRUE or FALSE, indicating whether to include shading for 95% confidence intervals. -
level
: Confidence level for confidence interval (due east.k.,0.90
for 90% conviction intervals)
Lines and polygons can also be useful for calculation references, equally in Figure 4.5. Useful geoms for such shapes include:
-
geom_hline
,geom_vline
: Add a horizontal or vertical line -
geom_abline
: Add a line with an intercept and gradient -
geom_polygon
: Add a filled polygon -
geom_path
: Add an unfilled polygon
You desire these references to support the main data shown in the plot, simply not overwhelm it. When adding these references:
- Add reference elements starting time, so they will be plotted under the data, instead of on top of it.
- Utilise
alpha
to add transparency to these elements. - Utilise colors that are unobtrusive (e.chiliad., grays).
- For lines, consider using non-solid line types (e.g.,
linetype = three
).
Highlighting
Guideline 4: Highlight interesting aspects.
Download a pdf of the lecture slides for this video.
Consider adding elements to highlight noteworthy elements of the data. For case, in the graph on the correct of Effigy 4.7, the days of the heat wave (based on temperature measurements) have been highlighted over the bloodshed time series by using a thick blood-red line.
In the below graphs, the names of the players with the near shots and passes have been added to highlight these unusual points.
One helpful way to annotate is with text, using geom_text()
. For this, you'll first need to create a dataframe with the hottest day in the data:
hottest_day <- chic_july %>% filter(temp == max(temp)) hottest_day[ , 1 : 6]
## date time twelvemonth month doy dow ## 1 1995-07-13 3116 1995 vii 194 Thursday
chic_plot + geom_text(data = hottest_day, characterization = "Max", size = 3)
With geom_text
, you'll ofttimes want to use position adjustment (the position
parameter) to move the text and then it won't exist right on top of the data points:
chic_plot + geom_text(data = hottest_day, label = "Max", size = iii, hjust = 0, vjust = - 1)
You lot can too utilize lines to highlight. For this, it is often useful to create a new dataframe with information for the reference. To add together a line for the Chicago heat wave, I've added a dataframe called hw
with the relevant date range. I'chiliad setting the y-value to be high enough (425) to ensure the line will exist placed above the mortality information.
hw <- data.frame(engagement = c(as.Date("1995-07-12"), as.Date("1995-07-xvi")), death = c(425, 425)) b <- chic_plot + geom_line(data = hw, aes(x = date, y = death), size = 2)
Order
Guideline 5: Make order meaningful.
Download a pdf of the lecture slides for this video.
You can make the ranking of data clearer from a graph past using order to show rank. Often, cistron or chiselled variables are ordered by something that is non interesting, similar alphabetical order.
Yous tin re-guild factor variables in a graph by resetting the cistron using the factor
role and changing the gild that levels are included in the levels
parameter.
Small multiples
Guideline 6: When possible, use small multiples.
Download a pdf of the lecture slides for this video.
Pocket-sized multiples are graphs that use many small plots showing the same thing for unlike facets of the information. For instance, instead of using color in a single plot to evidence data for males and females, you could use 2 small plots, ane each for males and females.
Typically, in small multiples, all plots use the aforementioned x- and y-axes. This makes it easier to compare across plots, and it likewise allows you to salve room by limiting axis annotation.
You lot tin can use the facet
functions to create small multiples. This separates the graph into several pocket-sized graphs, 1 for each level of a cistron.
The facet
functions are:
-
facet_grid()
-
facet_wrap()
For example, to create small-scale multiples by sex for the Nepali dataset, when plotting height versus weight, you lot can telephone call:
ggplot(nepali, aes(ht, wt)) + geom_point() + facet_grid(. ~ sex)
The facet_grid
role can facet by one or two variables. Ane will exist shown by rows, and one by columns:
## Generic code facet_grid([factor for rows] ~ [factor for columns])
The facet_wrap()
function tin only facet past one variable, but it tin "wrap" the small graphs for that variable, then the don't all have to be in one row or column:
## Generic code facet_wrap(~ [factor for faceting], ncol = [number of columns])
Frequently, when you do faceting, you'll want to re-name your factors levels or re-order them. For this, y'all'll demand to use the gene()
role on the original vector. For example, to rename the sex
factor levels from "1" and "two" to "Male" and "Female person," yous tin can run:
nepali <- nepali %>% mutate(sex activity = factor(sexual activity, levels = c(1, ii), labels = c("Male", "Female")))
Notice that the labels for the 2 graphs have now changed:
ggplot(nepali, aes(ht, wt)) + geom_point() + facet_grid(. ~ sex)
To re-order the factor, and show the plot for "Female" start, you can use factor
to change the club of the levels:
nepali <- nepali %>% mutate(sex = factor(sex, levels = c("Female", "Male")))
Now notice that the order of the plots has changed:
ggplot(nepali, aes(ht, wt)) + geom_point() + facet_grid(. ~ sex)
Advanced customization
Scales
There are a number of different functions for adjusting scales. These follow the following convention:
## Generic code scale_[aesthetic]_[vector type]
For example, to adjust the ten-centrality scale for a continuous variable, you lot'd use scale_x_continuous
. You can use a scale
part for an axis to change things similar the axis label (which you could also alter with xlab
or ylab
) also every bit position and labeling of breaks.
For example, hither is the default for plotting fourth dimension versus passes for the worldcup
dataset, with the number of shots taken shown past size and position shown by color:
ggplot(worldcup, aes(x = Fourth dimension, y = Passes, color = Position, size = Shots)) + geom_point(blastoff = 0.5)
ggplot(worldcup, aes(x = Fourth dimension, y = Passes, color = Position, size = Shots)) + geom_point(alpha = 0.5) + scale_x_continuous(name = "Time played (minutes)", breaks = ninety * c(2, iv, 6), minor_breaks = 90 * c(one, iii, 5))
Parameters y'all might discover useful in scale
functions include:
Parameter | Clarification |
---|---|
name | Label or legend name |
breaks | Vector of interruption points |
minor_breaks | Vector of modest suspension points |
labels | Labels to use for each break |
limits | Limits to the range of the centrality |
For dates, you can use scale
functions like scale_x_date
and scale_x_datetime
. For example, here's a plot of deaths in Chicago in July 1995 using default values for the x-centrality:
ggplot(chic_july, aes(x = engagement, y = death)) + geom_line()
And hither's an example of changing the formating and name of the x-axis:
ggplot(chic_july, aes(x = date, y = death)) + geom_line() + scale_x_date(name = "Engagement in July 1995", date_labels = "%one thousand-%d")
You lot can also use the calibration
functions to transform an centrality. For case, to show the Chicago plot with "deaths" on a log scale, you can run:
ggplot(chic_july, aes(x = engagement, y = death)) + geom_line() + scale_y_log10()
For colors and fills, the conventions for the names of the scale
functions tin vary. For instance, to adjust the color scale when y'all're mapping a detached variable (i.e., categorical, similar gender or animal brood) to colour, you lot'd utilise scale_color_hue
. To adjust the color scale for a continuous variable, like age, y'all'll use scale_color_gradient
.
For whatever colour scales, consider starting with brewer
starting time (e.g., scale_color_brewer
, scale_color_distiller
). Scale functions from brewer
allow you lot to prepare colors using different palettes. You can explore these palettes at http://colorbrewer2.org/.
The Brewer palettes fall into three categories: sequential, divergent, and qualitative. You lot should apply sequential or divergent for continuous data and qualitative for categorical data. Employ display.brewer.pal
to show the palette for a given number of colors.
library(RColorBrewer) display.brewer.pal(proper noun = "Set1", n = eight) display.brewer.pal(name = "PRGn", n = viii) display.brewer.pal(name = "PuBuGn", n = eight)
Use the palette
argument within a scales
role to customize the palette:
a <- ggplot(data.frame(x = 1 : 5, y = rnorm(v), group = letters[1 : 5]), aes(x = x, y = y, color = group)) + geom_point() b <- a + scale_color_brewer(palette = "Set1") c <- a + scale_color_brewer(palette = "Pastel2") + theme_dark() filigree.arrange(a, b, c, ncol = iii)
ggplot(worldcup, aes(ten = Time, y = Passes, color = Position, size = Shots)) + geom_point(blastoff = 0.5) + scale_color_brewer(palette = "Dark2", proper name = "Player position")
You can also ready colors manually:
ggplot(worldcup, aes(10 = Fourth dimension, y = Passes, colour = Position, size = Shots)) + geom_point(alpha = 0.5) + scale_color_manual(values = c("blue", "ruby-red", "darkgreen", "darkgray"))
To find out more
Some fantabulous farther references for plotting are:
- R Graphics Cookbook (book and website)
- Google images
For more technical details nigh plotting in R:
- ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham
- R Graphics, Paul Murrell
In-course exercise–Chapter 4
Designing a plot
For today'south exercise, you'll be building a plot using the worldcup
data from the faraway
bundle. First, load in that data. The name of each role player is in the rownames of this data. Use the tibble::rownames_to_column()
office to move those rownames into a new column named Player
. Also install and load the ggplot2
and ggthemes
packages.
Next, say you want to look at the relationship between the number of minutes that a actor played in the 2010 World Cup (Time
) and the number of shots the thespian took on goal (Shots
). On a sheet of paper, and talking with your partner, decide how the two of yous would pattern a plot to explore and nowadays this relationship. How would you incorporate some of the principles of creating good graphs?
Example R code
For this department, the simply lawmaking needed is lawmaking to load the required packages, load the data, and move the rownames to a column named Player
.
library(faraway) information(worldcup) head(worldcup, two)
## Team Position Time Shots Passes Tackles Saves ## Abdoun Algeria Midfielder 16 0 half dozen 0 0 ## Abe Japan Midfielder 351 0 101 14 0
This dataset has the players' names as rownames, rather than in a cavalcade. One time we start using dplyr
functions, nosotros'll lose these rownames. Therefore, start by converting the rownames to a column called Thespian
:
library(dplyr) worldcup <- worldcup %>% tibble:: rownames_to_column(var = "Player") caput(worldcup, ii)
## Player Team Position Time Shots Passes Tackles Saves ## ane Abdoun Algeria Midfielder sixteen 0 6 0 0 ## 2 Abe Japan Midfielder 351 0 101 14 0
Install and load the ggplot2
package:
# install.packages("ggplot2") library(ggplot2) # install.packages("ggthemes") library(ggthemes)
Implementing plot guidelines #1
In this section, we'll piece of work on creating a plot like this:
Do the following tasks:
- Create a simple scatterplot of Time versus Shots for the Globe Cup data. It should look like this:
- Next, before whatever more coding, talk with your grouping members almost how the ultimate graph we're working on (the one printed at the beginning of this exercise section) is different from the simple one you created with
ggplot
for the last bullet indicate. Also discuss what you lot can figure out from this new graph that was less clear from a simpler scatterplot of Time versus Shots for this data. - Often, in graphs with a lot of points, it'southward difficult to meet some of the points, because they overlap other points. Three strategies to accost this are: (a) make the points smaller; and (b) make the points somewhat transparent. Try doing these first two with the scatterplot y'all're creating. At this point, the plot should look something similar this:
- Create a new column in the
worldcup
data calledtop_four
that specifies whether or not theTeam
for that observation was one of the top four teams in the tournament (Netherlands, Uruguay, Espana, and Germany). Make the colors of the points stand for to whether the team was a superlative-4 team. At this point, the plot should wait something like this:
- Increase data density: Try changing the theme, to come up upward with a graph with a bit less non-data ink. From the
ggthemes
package, try some of the following themes:theme_few()
,theme_tufte()
,theme_stata()
,theme_fivethirtyeight()
,theme_economist_white()
, andtheme_wsj()
. Pick a theme that helps increase the graph's information density. At this betoken, the plot should expect something like this:
- Use meaningful labels: Use the
labs()
role to make a clearer title for the x-centrality. (You may have already written this code in the last department of this practise.) In add-on to setting the x-axis title with thelabs
function, y'all can also gear up the title for the color scale (employcolor =
within thelabs
role). Y'all may want to brand a line pause in the color title– you can use the linebreak graphic symbol (\n
) within the grapheme string with the title to practise that. At this point, the plot should await something like this:
- Provide useful references: The standard time for a soccer game is 90 minutes. In the Earth Loving cup, all teams play at least three games, and so the pinnacle teams continue and play more games. Add a reference line at 270 minutes (i.east., the amount of standard time played for the three games that all teams play). At this point, the plot should expect something like this:
Instance R code
Every bit a reminder, here'due south the code to do a simple scatterplot ot Shots past Time for the worldcup
information:
ggplot(information = worldcup) + geom_point(mapping = aes(x = Time, y = Shots))
Next, attempt to brand it clearer to encounter the points by making them smaller and somewhat transparent. This can exist done with the size
and alpha
aesthetics for geom_points
. For the size
aesthetic, a value smaller than about 2 = smaller than default, larger than almost 2 = larger than default. For the alpha
aesthetic, closer to 0 = more tranparent, closer to 1 = more opaque. Equally a reminder, in this case you are irresolute all of the points in the aforementioned mode, so yous volition be setting those aesthetics to constant values. That means that you should specify the values outside of an aes
call. This code could make these changes:
ggplot(information = worldcup) + geom_point(mapping = aes(10 = Time, y = Shots), size = 1, alpha = 0.5)
To create a new cavalcade called top_four
, get-go create vector that lists those top four teams, then create a logical vector in the dataframe for whether the team for that ascertainment is in one of the peak four teams:
worldcup <- worldcup %>% mutate(top_4 = Squad %in% c("Spain", "Germany", "Uruguay", "Netherlands")) head(worldcup)
## Team Position Time Shots Passes Tackles Saves Player top_4 ## Abdoun Algeria Midfielder 16 0 6 0 0 Abdoun FALSE ## Abe Japan Midfielder 351 0 101 14 0 Abe Faux ## Abidal France Defender 180 0 91 6 0 Abidal False ## Abou Diaby French republic Midfielder 270 one 111 5 0 Abou Diaby Fake ## Aboubakar Cameroon Frontward 46 two xvi 0 0 Aboubakar FALSE ## Abreu Uruguay Forward 72 0 15 0 0 Abreu True
## Manner False True ## logical 517 78
To color points by this variable, apply color =
in the aes()
part of the ggplot()
phone call:
ggplot(information = worldcup) + geom_point(mapping = aes(x = Time, y = Shots, color = top_4), size = i, alpha = 0.5)
To increment the data density, endeavour out different themes for the plot. First, I'll save everything nosotros've done so far as the object shot_plot
, then I'll try adding unlike themes:
shot_plot <- ggplot(information = worldcup) + geom_point(mapping = aes(10 = Time, y = Shots, color = top_4), size = 1, blastoff = 0.5) shot_plot + theme_few()
shot_plot + theme_tufte()
shot_plot + theme_fivethirtyeight()
shot_plot + theme_stata()
shot_plot + theme_economist_white()
The data density is increased with the theme_few()
theme, and so I'll apply that:
ggplot(information = worldcup) + geom_point(mapping = aes(x = Fourth dimension, y = Shots, color = top_4), size = i, blastoff = 0.five) + theme_few()
To alter the titles for some of the scales (the 10-axis and colour calibration), you lot can use the labs()
function. Note that you lot can use \n
to add a line suspension inside one of these titles (I've washed that for the title for the color scale):
ggplot(data = worldcup) + geom_point(mapping = aes(ten = Fourth dimension, y = Shots, color = top_4), size = 1, alpha = 0.5) + theme_few() + labs(x = "Time played in World Cup (minutes)", colour = "Team's final \n ranking")
Equally an extra annotation, if you want to create nicer labels for the legend for color, catechumen the top_four
column into the gene class, with the labels you want to use in the figure fable:
worldcup <- worldcup %>% mutate(top_4 = gene(top_4, levels = c(True, Simulated), labels = c("Top four", "Other"))) summary(worldcup$top_4)
## Superlative four Other ## 78 517
ggplot(data = worldcup) + geom_point(mapping = aes(10 = Time, y = Shots, color = top_4), size = i, alpha = 0.v) + theme_few() + labs(x = "Time played in World Cup (minutes)", color = "Team's last \due north ranking")
To add together a reference line at 270 minutes of fourth dimension, use the geom_vline()
function. You'll want to make it a low-cal color (like low-cal gray) and dashed or dotted (linetype
of 2 or 3), so it won't be too prominent on the graph:
ggplot(data = worldcup) + geom_vline(xintercept = 270, color = "lightgray", linetype = 2) + geom_point(mapping = aes(x = Fourth dimension, y = Shots, color = top_4), size = 1, alpha = 0.v) + theme_few() + labs(ten = "Fourth dimension played in World Cup (minutes)", color = "Squad'due south last \n ranking")
Implementing plot guidelines #2
- Highlighting interesting data: Who had the most shots in the 2010 World Cup? Was he on a elevation-iv team? Use
geom_text()
to label his point on the graph with his name (try out some different values ofhjust
andvjust
in this role call to get the characterization in a place you similar). At this point, the plot should look something similar this:
- For labeling the player with the top number of shots, instead of only using the player'south name, use the following format: "[Player's name], [Player's team]." (Hint: Y'all may desire to use
mutate
to add together a new column, where you usedpaste0
to paste together the histrion's proper noun,", "
, and the team proper noun.) At this bespeak, the plot should look something like this:
- Create small multiples. The relationship between fourth dimension played and shots taken is probably different by the players' positions. Use faceting to create different graphs for each position. At this betoken, the plot should look something like this:
- Make gild meaningful: What order are the faceted graphs currently in? Offensive players have more than chances to take shots than defensive players, and then that might exist a useful ordering for the facets. Re-order the
Position
factor column to go from nearest your own goal to nearest the opponents goal, then re-plot the graph from the previous step.
Case R lawmaking
To add together a text label with just the histrion with the most shots, you'll want to create a new dataframe with only the height role player. Y'all tin can use the top_n
function to do that (the wt
option is specifying that we desire the superlative player in terms of values in the Shots
column):
top_player <- worldcup %>% top_n(due north = 1, wt = Shots)
Now you can use geom_text()
to characterization this player'southward betoken on the graph with his proper noun. You lot may demand to mess effectually with some of the options in geom_text()
, like size
, hjust
, and vjust
(hjust
and vjust
say where, in relation to the point location, to put the label), to get something you're happy with.
worldcup %>% mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Federal republic of germany")) %>% ggplot(aes(x = Time, y = Shots, color = top_4)) + geom_vline(xintercept = 90 * 3, colour = "grey", linetype = 2) + geom_point(blastoff = 0.5, size = 1) + geom_text(information = top_player, aes(label = Player, color = NULL), hjust = 1.2, vjust = 0.4) + labs(x = "Fourth dimension played in World Cup (minutes)", colour = "Team's final \n ranking") + theme_few()
If you want to put both the player's proper noun and his team, you tin can add together a mutate()
function when you create the new dataframe with just the meridian player, so use this for the label:
top_player <- worldcup %>% top_n(n = one, wt = Shots) %>% mutate(label = paste0(Player, ", ", Team)) worldcup %>% mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Espana", "Frg")) %>% ggplot(aes(x = Fourth dimension, y = Shots, color = top_4)) + geom_vline(xintercept = ninety * 3, color = "gray", linetype = ii) + geom_point(alpha = 0.5, size = ane) + geom_text(data = top_player, aes(label = label, colour = Zippo), hjust = one.1, vjust = 0.4) + labs(x = "Fourth dimension played in Globe Cup (minutes)", color = "Team's final \n ranking") + theme_few()
To create pocket-size multiples, apply the facet_wrap()
command (y'all'll probably desire to utilise ncol
to specify to employ 4 columns):
top_player <- worldcup %>% top_n(n = 1, wt = Shots) %>% mutate(label = paste0(Player, ", ", Team)) worldcup %>% mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Spain", "Germany")) %>% ggplot(aes(x = Time, y = Shots, color = top_4)) + geom_vline(xintercept = xc * 3, colour = "gray", linetype = 2) + geom_point(alpha = 0.five, size = i) + geom_text(information = top_player, aes(label = label, colour = NULL), hjust = ane.i, vjust = 0.4) + labs(ten = "Time played in Globe Cup (minutes)", color = "Team'due south final \due north ranking") + theme_few() + facet_wrap(~ Position, ncol = 4)
To re-order the Position
cavalcade of the dataframe, add together a mutate
statement earlier you pipage into the plotting lawmaking. Apply the levels
option of the factor()
function– whatever lodge you put the factors in for this argument will exist the new club in which R saves the levels of this factor.
worldcup %>% mutate(top_4 = Team %in% c("Netherlands", "Uruguay", "Kingdom of spain", "Federal republic of germany"), Position = factor(Position, levels = c("Goalkeeper", "Defender", "Midfielder", "Frontwards"))) %>% ggplot() + geom_vline(xintercept = 90 * iii, colour = "gray", linetype = ii) + geom_point(aes(10 = Time, y = Shots, color = top_4), alpha = 0.5, size = ane) + geom_text(data = top_player, aes(x = Time, y = Shots, label = label), hjust = i.1, vjust = 0.four) + labs(ten = "Fourth dimension played in World Cup (minutes)", color = "Team's final \n ranking") + theme_few() + facet_wrap(~ Position, ncol = four)
Notation from this code example that you can use the levels
role to find out the levels and their order for a factor-class vector.
worldcup <- worldcup %>% mutate(Position = factor(Position, levels = c("Goalkeeper", "Defender", "Midfielder", "Forward"))) levels(worldcup$Position)
## [1] "Goalkeeper" "Defender" "Midfielder" "Forward"
Data visualization cheatsheet
RStudio comes with some fantabulous cheatsheets, which provide quick references to functions and code y'all might find useful for different tasks. For this part of the group practice, you'll explore their cheatsheet for data visualization, both to learn some new ggplot2
code and to go familiar with how to use this cheatsheet as you do your own analysis.
- Open up the data visualization cheatsheet. You tin do this from RStudio by going to "Help" -> "Cheatsheets" -> "Information Visualization with ggplot2."
- Notice that different sections requite examples with some datasets that come with either base R or ggplot2. For example, under the "Graphical Primitives" section, there is code defining the object
a
as a ggplot object using the "seals" dataset:a <- ggplot(seals, aes(x = long, y = lat))
. - Go through the cheatsheet and listing all of the example datasets that are used in this cheatsheet. Open their helpfiles to learn more than about the data.
- Create the case datasets
a
throughl
andsouthward
throught
using the lawmaking given on the cheatsheet. - Pick at to the lowest degree one example to try out from each of the post-obit sections: "Graphical Primitives," "One Variable," at least three subsections of "Two Variables," "Three Variables," "Scales," "Faceting," and "Position Adjustments." Equally you attempt these, try to figure out any aesthetics that you aren't familiar with (e.g.,
ymin
,ymax
). Also, apply helpfiles for the geoms to wait up parameters yous aren't familiar with (due east.g.,stat
forgeom_area
). If you lot can't figure out how to translate a plot, cheque the helpfile for the associated geom. Notation: For then
geom used in "scales," it should be defined every bitn <- d + geom_bar(aes(fill = fl))
.
Example R code
The lawmaking for opening the helpfiles for the example datasets is:
?seals ?economics ?mpg ?diamonds ?USArrests
Note that, for USArrests
, but some of the columns are pulled out (due east.g., murder = USArrests$murder
) to use in the information
case dataframe. Further, the "Visualizing fault" examples use a dataframe created specifically for these examples, called df
.
Some of the base of operations R and ggplot2 example datasets have go adequately well-known. Some that you'll see very frequently in examples are the iris
, mpg
, and diamonds
datasets.
All of the code to create the datasets a
through 50
and s
through t
is given somewhere on the cheatsheet. Hither it is in full:
a <- ggplot(seals, aes(10 = long, y = lat)) b <- ggplot(economics, aes(date, unemploy)) c <- ggplot(mpg, aes(hwy)) d <- ggplot(mpg, aes(fl)) e <- ggplot(mpg, aes(cty, hwy)) f <- ggplot(mpg, aes(course, hwy)) yard <- ggplot(diamonds, aes(cut, colour)) h <- ggplot(diamonds, aes(carat, cost)) i <- ggplot(economics, aes(date, unemploy)) df <- data.frame(grp = c("A", "B"), fit = 4.v, se = 1 : two) j <- ggplot(df, aes(grp, fit, ymin = fit - se, ymax = fit + se)) data <- information.frame(murder = USArrests$Murder, land = tolower(rownames(USArrests))) map <- map_data("state") chiliad <- ggplot(information, aes(fill = murder)) seals$z <- with(seals, sqrt(delta_long^ 2 + delta_lat^ 2)) l <- ggplot(seals, aes(long, lat)) s <- ggplot(mpg, aes(fl, make full = drv)) t <- ggplot(mpg, aes(cty, hwy)) + geom_point()
Notice that, in some places, the aesthetics are defined using the total artful proper name-value pair (eastward.g., aes(x = long, y = lat)
), while in other places the code relies on position for defining which column of a dataframe maps to which artful (e.m., aes(cty, hwy)
or aes(fl)
). Either is fine, although relying on position tin can result in errors if you lot are not very familiar with the order in which parameters are defined for a office.
This code will vary based on the examples you attempt, but hither is some code for one set of examples:
b + geom_ribbon(aes(ymin = unemploy - 900, ymax = unemploy + 900)) c + geom_dotplot() f + geom_violin(scale = "area") h + geom_hex() j + geom_pointrange() k + geom_map(aes(map_id = state), map = map) + expand_limits(x = map$long, y = map$lat) 50 + geom_contour(aes(z = z)) n <- d + geom_bar(aes(fill = fl)) n + scale_fill_brewer(palette = "Dejection") o <- c + geom_dotplot(aes(fill = ..10..)) o + scale_fill_gradient(low = "red", high = "yellow") t + facet_grid(year ~ fl) s + geom_bar(position = "fill up")
chesneytistonedide47.blogspot.com
Source: https://geanders.github.io/RProgrammingForResearch/reporting-data-results-1.html
0 Response to "R Project Sample Code That Read and Plots"
Post a Comment