Reporting data results #1

The video lectures for this chapter are embedded at relevant places in the text, with links to download a pdf of the associated slides for each video. You can also access a full playlist for the videos for this chapter.

Guidelines for adept plots

Download a pdf of the lecture slides for this video.

There are a number of very thoughtful books and manufactures about creating graphics that effectively communicate data. Some of the authors I highly recommend (and from whose work I've pulled the guidelines for adept graphics nosotros'll talk about this calendar week) are:

  • Edward Tufte
  • Howard Wainer
  • Stephen Few
  • Nathan Yau

You should programme, in particular, to read The Visual Display of Quantitative Data by Edward Tufte earlier you lot graduate.

This week, nosotros'll focus on six guidelines for good graphics, based on the writings of these and other specialists in data display. The guidelines are:

  1. Aim for high data density.
  2. Utilise clear, meaningful labels.
  3. Provide useful references.
  4. Highlight interesting aspects of the information.
  5. Make lodge meaningful.
  6. When possible, use small multiples.

For the examples, I'll utilize dplyr for information cleaning and, for plotting, the packages ggplot2, gridExtra, and ggthemes.

                                                library(tidyverse)                  ## Loads `dplyr` and `ggplot2`                                                  library(gridExtra)                                  library(ggthemes)                          

You can load the data for today's examples with the following code:

                                                library(faraway)                                  information(nepali)                                  data(worldcup)                                                  library(dlnm)                                  data(chicagoNMMAPS)                chic                  <-                  chicagoNMMAPS                chic_july                  <-                  chic                  %>%                                                  filter(month                  ==                  7                  &                  year                  ==                  1995)                          

High data density

Guideline 1: Aim for high data density.

Download a pdf of the lecture slides for this video.

Y'all should try to increase, as much as possible, the data to ink ratio in your graphs. This is the ratio of "ink" providing information to all ink used in the effigy. One manner to think about this is that the just graphs you lot make that use upwardly a lot of your printer's ink should be packed with information.

The ii graphs in Figure 4.1 show the same information, but use very different amounts of ink. Each shows the number of players in each of iv positions in the worldcup dataset. Observe how, in the plot on the correct, a unmarried dot for each category shows the same data that a whole filled bar is showing on the left. Farther, the plot on the correct has removed the gridded groundwork, removing even more "ink."

Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows the number of players in each position in the worldcup dataset from the faraway package.

Effigy iv.one: Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows the number of players in each position in the worldcup dataset from the faraway package.

Figure four.ii gives another example of 2 plots that show the same data but with very different data densities. This figure uses the chicagoNMMAPS information from the dlnm parcel, which includes daily mortality, weather, and air pollution information for Chicago, IL. Both plots show daily mortality counts during July 1995, when a very severe heat wave hit Chicago. Notice how many of the elements in the plot on the left, including the shading under the bloodshed time series and the colored background and filigree lines, are unnecessary for interpreting the message from the information.

Example of plots with lower (left) and higher (right) data-to-ink ratios. Each plot shows daily mortality in Chicago, IL, in July 1995 using the chicagoNMMAPS data from the dlnm package.

Effigy 4.ii: Example of plots with lower (left) and college (right) data-to-ink ratios. Each plot shows daily mortality in Chicago, IL, in July 1995 using the chicagoNMMAPS data from the dlnm package.

By increasing the information-to-ink ratio in a plot, yous can help viewers see the bulletin of the data more than quickly. A cluttered plot is harder to interpret. Farther, you exit room to add some of the other elements I'll talk about, including highlighting interesting data and calculation useful references. Notice how the plots on the left in Figures iv.1 and four.2 are already cluttered and leave piffling room for calculation actress elements, while the plots on the right of those figures have much more room for additions.

Ane quick way to increment information density in ggplot2 is to change the theme for the plot. The theme specifies a number of the "groundwork" elements to a plot, including elements like the plot grid, background color, and the font used for labeling. Some themes come with ggplot2, including:

  • theme_bw
  • theme_minimal
  • theme_void

You can detect more themes in packages that extend ggplot2. The ggthemes package, in particular, has some fantabulous additional themes.

Figures 4.three shows some examples of the effects of using different themes. All show the aforementioned information– a plot of daily deaths in Chicago in July 1995. The top left graph shows the graph with the default theme. The other plots show the effects of adding different themes, including the black-and-white theme that comes with ggplot2 (summit correct) and various themes from the ggthemes package. You can even utilize themes to add some questionable choices for different elements, similar the Excel theme (lesser left).

Daily mortality in Chicago, IL, in July 1995. This figure gives an example of the plot using different themes.

Figure 4.three: Daily bloodshed in Chicago, IL, in July 1995. This figure gives an case of the plot using different themes.

Meaningful labels

Guideline 2: Utilize clear, meaningful labels.

Download a pdf of the lecture slides for this video.

Graphs often default to apply abbreviations for axis labels and other labeling. For example, the default is for ggplot2 plots to utilise column names for the x- and y-axes of a scatterplot. While this is user-friendly for exploratory plots, it's often not acceptable for plots for presentations and papers. You'll want to use short and easy-to-blazon cavalcade names in your dataframe to make coding easier, merely y'all should employ longer and more meaningful labeling in plots and tables that others need to interpret.

Furthermore, text labels can sometimes be aligned in a way that makes them difficult to read. For example, when plotting a chiselled variable along the x-centrality, it can be hard to fit labels for each category that are long plenty to be meaningful.

Figure four.iv gives an instance of the same information shown with labels that are harder to interpret (left) versus with clear, meaningful labels (right). Notice how the graph on the left is using abbreviations for the categorical variable ("DF" for "Defence"), abbreviations for axis labels ("Pos" for "Position" and "Pls" for "Number of players"), and has the player position labels in a vertical alignment. On the right graph, I take fabricated the graph easier to quickly read and interpret past spelling out all labels and switching the ten- and y-axes, so that at that place'south room to fully spell out each position while still keeping the alignment horizontal, then the reader doesn't have to turn the page (or their head) to read the values.

The number of players in each position in the worldcup data from the faraway package. Both graphs show the same information, but the left graph has murkier labels, while the right graph has labels that are easier to read and interpret.

Figure iv.4: The number of players in each position in the worldcup data from the faraway packet. Both graphs show the aforementioned data, simply the left graph has murkier labels, while the correct graph has labels that are easier to read and translate.

There are a few strategies you can utilise to make labels clearer when plotting with ggplot2:

  • Add xlab and ylab elements to the plot, rather than relying on the cavalcade names in the original data. You tin can too relabel x- and y-axes with scale elements (e.g., scale_x_continuous), and the scale functions requite you lot more power to also brand other changes to the x- and y-axes (e.g., irresolute break points for the axis ticks). However, if you just demand to change axis labels, xlab and ylab are often quicker.
  • Include units of measurement in centrality titles when relevant. If units are dollars or percent, check out the scales parcel, which allows you lot to add together labels direct to axis elements by including arguments like labels = percent in scale elements. See the helpfile for scale_x_continuous for some examples.
  • If the x-variable requires longer labels, every bit is frequently the example with categorical data (for example, player positions Figure iv.4), consider flipping the coordinates, rather than abbreviating or rotating the labels. You can use coord_flip to do this.

References

Guideline 3: Provide useful references.

Data is easier to interpret when yous add together references. For example, if you testify what it typical, it helps viewers translate how unusual outliers are.

Figure 4.5 shows daily bloodshed during July 1995 in Chicago, IL. The graph on the correct has added shading showing the range of daily death counts in July in Chicago for neighboring years (1990–1994 and 1996–2000). This added reference helps clarify for viewers how unusual the number of deaths during the July 1995 heat wave was.

Daily mortality during July 1995 in Chicago, IL. In the graph on the right, I have added a shaded region showing the range of daily mortality counts for neighboring years, to show how unusual this event was.

Effigy 4.five: Daily mortality during July 1995 in Chicago, IL. In the graph on the right, I accept added a shaded region showing the range of daily mortality counts for neighboring years, to show how unusual this event was.

Some other useful fashion to add references is to add together a linear or polish fit to the data, to help clarify trends in the data. Effigy 4.6 shows the relationship between passes and shots for Forwards in the worldcup dataset. The plot on the right has added a smooth function of the relationship between these 2 variables.

Relationship between passes and shots taken among Forwards in the worldcup dataset from the faraway package. The plot on the right has a smooth function added to help show the relationship between these two variables.

Figure iv.6: Relationship between passes and shots taken among Forrad in the worldcup dataset from the faraway packet. The plot on the right has a polish part added to assistance show the relationship between these ii variables.

For scatterplots created with ggplot2, you tin use the role geom_smooth to add together a smooth or linear reference line. Here is the lawmaking that produces Figure 4.6:

                                                ggplot(filter(worldcup, Position                  ==                  "Forward"),                                  geom_point(size =                  ane.5)                  +                                                  theme_few()                  +                                                  geom_smooth()                          

The almost useful geom_smooth parameters to know are:

  • method: The default is to add a loess curve if the data includes less than g points and a generalized additive model for 1000 points or more than. However, you tin change to show the fitted line from a linear model using method = "lm" or from a generalized linear model using method = "glm".
  • span: How wiggly or smooth the smooth line should be (smaller value: more wiggly; larger value: more polish)
  • se: TRUE or FALSE, indicating whether to include shading for 95% confidence intervals.
  • level: Confidence level for confidence interval (due east.k., 0.90 for 90% conviction intervals)

Lines and polygons can also be useful for calculation references, equally in Figure 4.5. Useful geoms for such shapes include:

  • geom_hline, geom_vline: Add a horizontal or vertical line
  • geom_abline: Add a line with an intercept and gradient
  • geom_polygon: Add a filled polygon
  • geom_path: Add an unfilled polygon

You desire these references to support the main data shown in the plot, simply not overwhelm it. When adding these references:

  • Add reference elements starting time, so they will be plotted under the data, instead of on top of it.
  • Utilise alpha to add transparency to these elements.
  • Utilise colors that are unobtrusive (e.chiliad., grays).
  • For lines, consider using non-solid line types (e.g., linetype = three).

Highlighting

Guideline 4: Highlight interesting aspects.

Download a pdf of the lecture slides for this video.

Consider adding elements to highlight noteworthy elements of the data. For case, in the graph on the correct of Effigy 4.7, the days of the heat wave (based on temperature measurements) have been highlighted over the bloodshed time series by using a thick blood-red line.

Mortality in Chicago, July 1995. In the plot on the right, a thick red line has been added to show the dates of a heat wave.

Figure iv.7: Mortality in Chicago, July 1995. In the plot on the right, a thick red line has been added to show the dates of a estrus moving ridge.

In the below graphs, the names of the players with the near shots and passes have been added to highlight these unusual points.

One helpful way to annotate is with text, using geom_text(). For this, you'll first need to create a dataframe with the hottest day in the data:

                              hottest_day                  <-                  chic_july                  %>%                                                  filter(temp                  ==                  max(temp))                hottest_day[ ,                  1                  :                  6]                          
            ##         date time twelvemonth month doy      dow ## 1 1995-07-13 3116 1995     vii 194 Thursday          
                              chic_plot                  +                  geom_text(data =                  hottest_day,                                                  characterization =                  "Max",                                  size =                  3)                          

With geom_text, you'll ofttimes want to use position adjustment (the position parameter) to move the text and then it won't exist right on top of the data points:

                              chic_plot                  +                  geom_text(data =                  hottest_day,                                                  label =                  "Max",                                  size =                  iii,                  hjust =                  0,                  vjust =                  -                  1)                          

You lot can too utilize lines to highlight. For this, it is often useful to create a new dataframe with information for the reference. To add together a line for the Chicago heat wave, I've added a dataframe called hw with the relevant date range. I'chiliad setting the y-value to be high enough (425) to ensure the line will exist placed above the mortality information.

                              hw                  <-                  data.frame(engagement =                  c(as.Date("1995-07-12"),                                                  as.Date("1995-07-xvi")),                                  death =                  c(425,                  425))                                                b                  <-                  chic_plot                  +                                                  geom_line(data =                  hw,                                  aes(x =                  date,                  y =                  death),                                  size =                  2)                          

Order

Guideline 5: Make order meaningful.

Download a pdf of the lecture slides for this video.

You can make the ranking of data clearer from a graph past using order to show rank. Often, cistron or chiselled variables are ordered by something that is non interesting, similar alphabetical order.

Yous tin re-guild factor variables in a graph by resetting the cistron using the factor role and changing the gild that levels are included in the levels parameter.

Small multiples

Guideline 6: When possible, use small multiples.

Download a pdf of the lecture slides for this video.

Pocket-sized multiples are graphs that use many small plots showing the same thing for unlike facets of the information. For instance, instead of using color in a single plot to evidence data for males and females, you could use 2 small plots, ane each for males and females.

Typically, in small multiples, all plots use the aforementioned x- and y-axes. This makes it easier to compare across plots, and it likewise allows you to salve room by limiting axis annotation.

You lot tin can use the facet functions to create small multiples. This separates the graph into several pocket-sized graphs, 1 for each level of a cistron.

The facet functions are:

  • facet_grid()
  • facet_wrap()

For example, to create small-scale multiples by sex for the Nepali dataset, when plotting height versus weight, you lot can telephone call:

                                                ggplot(nepali,                  aes(ht, wt))                  +                                                  geom_point()                  +                                                  facet_grid(.                  ~                  sex)                          

The facet_grid role can facet by one or two variables. Ane will exist shown by rows, and one by columns:

                                                ## Generic code                                                  facet_grid([factor                  for                  rows]                  ~                  [factor                  for                  columns])                          

The facet_wrap() function tin only facet past one variable, but it tin "wrap" the small graphs for that variable, then the don't all have to be in one row or column:

                                                ## Generic code                                                  facet_wrap(~                  [factor                  for                  faceting],                  ncol =                  [number of columns])                          

Frequently, when you do faceting, you'll want to re-name your factors levels or re-order them. For this, y'all'll demand to use the gene() role on the original vector. For example, to rename the sex factor levels from "1" and "two" to "Male" and "Female person," yous tin can run:

                              nepali                  <-                  nepali                  %>%                                                  mutate(sex activity =                  factor(sexual activity,                  levels =                  c(1,                  ii),                                                  labels =                  c("Male",                  "Female")))                          

Notice that the labels for the 2 graphs have now changed:

                                                ggplot(nepali,                  aes(ht, wt))                  +                                                  geom_point()                  +                                                  facet_grid(.                  ~                  sex)                          

To re-order the factor, and show the plot for "Female" start, you can use factor to change the club of the levels:

                              nepali                  <-                  nepali                  %>%                                                  mutate(sex =                  factor(sex,                  levels =                  c("Female",                  "Male")))                          

Now notice that the order of the plots has changed:

                                                ggplot(nepali,                  aes(ht, wt))                  +                                                  geom_point()                  +                                                  facet_grid(.                  ~                  sex)                          

Advanced customization

Scales

There are a number of different functions for adjusting scales. These follow the following convention:

                                                      ## Generic code                                    scale_[aesthetic]_[vector type]                              

For example, to adjust the ten-centrality scale for a continuous variable, you lot'd use scale_x_continuous. You can use a scale part for an axis to change things similar the axis label (which you could also alter with xlab or ylab) also every bit position and labeling of breaks.

For example, hither is the default for plotting fourth dimension versus passes for the worldcup dataset, with the number of shots taken shown past size and position shown by color:

                                                      ggplot(worldcup,                    aes(x =                    Fourth dimension,                    y =                    Passes,                                      color =                    Position,                    size =                    Shots))                    +                                                        geom_point(blastoff =                    0.5)                              

                                                      ggplot(worldcup,                    aes(x =                    Fourth dimension,                    y =                    Passes,                                      color =                    Position,                    size =                    Shots))                    +                                                        geom_point(alpha =                    0.5)                    +                                                        scale_x_continuous(name =                    "Time played (minutes)",                                                        breaks =                    ninety                    *                    c(2,                    iv,                    6),                                      minor_breaks =                    90                    *                    c(one,                    iii,                    5))                              

Parameters y'all might discover useful in scale functions include:

Parameter Clarification
name Label or legend name
breaks Vector of interruption points
minor_breaks Vector of modest suspension points
labels Labels to use for each break
limits Limits to the range of the centrality

For dates, you can use scale functions like scale_x_date and scale_x_datetime. For example, here's a plot of deaths in Chicago in July 1995 using default values for the x-centrality:

                                                      ggplot(chic_july,                    aes(x =                    engagement,                    y =                    death))                    +                                                        geom_line()                                                

And hither's an example of changing the formating and name of the x-axis:

                                                      ggplot(chic_july,                    aes(x =                    date,                    y =                    death))                    +                                                        geom_line()                    +                                                        scale_x_date(name =                    "Engagement in July 1995",                                      date_labels =                    "%one thousand-%d")                              

You lot can also use the calibration functions to transform an centrality. For case, to show the Chicago plot with "deaths" on a log scale, you can run:

                                                      ggplot(chic_july,                    aes(x =                    engagement,                    y =                    death))                    +                                                        geom_line()                    +                                                        scale_y_log10()                              

For colors and fills, the conventions for the names of the scale functions tin vary. For instance, to adjust the color scale when y'all're mapping a detached variable (i.e., categorical, similar gender or animal brood) to colour, you lot'd utilise scale_color_hue. To adjust the color scale for a continuous variable, like age, y'all'll use scale_color_gradient.

For whatever colour scales, consider starting with brewer starting time (e.g., scale_color_brewer, scale_color_distiller). Scale functions from brewer allow you lot to prepare colors using different palettes. You can explore these palettes at http://colorbrewer2.org/.

The Brewer palettes fall into three categories: sequential, divergent, and qualitative. You lot should apply sequential or divergent for continuous data and qualitative for categorical data. Employ display.brewer.pal to show the palette for a given number of colors.

                                                      library(RColorBrewer)                                      display.brewer.pal(proper noun =                    "Set1",                    n =                    eight)                                      display.brewer.pal(name =                    "PRGn",                    n =                    viii)                                      display.brewer.pal(name =                    "PuBuGn",                    n =                    eight)                              

Use the palette argument within a scales role to customize the palette:

                                  a                    <-                    ggplot(data.frame(x =                    1                    :                    5,                    y =                    rnorm(v),                                      group =                    letters[1                    :                    5]),                                      aes(x =                    x,                    y =                    y,                    color =                    group))                    +                                                        geom_point()                  b                    <-                    a                    +                    scale_color_brewer(palette =                    "Set1")                  c                    <-                    a                    +                    scale_color_brewer(palette =                    "Pastel2")                    +                                                        theme_dark()                                      filigree.arrange(a, b, c,                    ncol =                    iii)                              

                                                      ggplot(worldcup,                    aes(ten =                    Time,                    y =                    Passes,                                      color =                    Position,                    size =                    Shots))                    +                                                        geom_point(blastoff =                    0.5)                    +                                                        scale_color_brewer(palette =                    "Dark2",                                      proper name =                    "Player position")                              

You can also ready colors manually:

                                                      ggplot(worldcup,                    aes(10 =                    Fourth dimension,                    y =                    Passes,                                      colour =                    Position,                    size =                    Shots))                    +                                                        geom_point(alpha =                    0.5)                    +                                                        scale_color_manual(values =                    c("blue",                    "ruby-red",                                                        "darkgreen",                    "darkgray"))                              

To find out more

Some fantabulous farther references for plotting are:

  • R Graphics Cookbook (book and website)
  • Google images

For more technical details nigh plotting in R:

  • ggplot2: Elegant Graphics for Data Analysis, Hadley Wickham
  • R Graphics, Paul Murrell

In-course exercise–Chapter 4

Designing a plot

For today'south exercise, you'll be building a plot using the worldcup data from the faraway bundle. First, load in that data. The name of each role player is in the rownames of this data. Use the tibble::rownames_to_column() office to move those rownames into a new column named Player. Also install and load the ggplot2 and ggthemes packages.

Next, say you want to look at the relationship between the number of minutes that a actor played in the 2010 World Cup (Time) and the number of shots the thespian took on goal (Shots). On a sheet of paper, and talking with your partner, decide how the two of yous would pattern a plot to explore and nowadays this relationship. How would you incorporate some of the principles of creating good graphs?

Example R code

For this department, the simply lawmaking needed is lawmaking to load the required packages, load the data, and move the rownames to a column named Player.

                                                            library(faraway)                                          information(worldcup)                                          head(worldcup,                      two)                                  
                ##           Team   Position Time Shots Passes Tackles Saves ## Abdoun Algeria Midfielder   16     0      half dozen       0     0 ## Abe      Japan Midfielder  351     0    101      14     0              

This dataset has the players' names as rownames, rather than in a cavalcade. One time we start using dplyr functions, nosotros'll lose these rownames. Therefore, start by converting the rownames to a column called Thespian:

                                                            library(dplyr)                    worldcup                      <-                      worldcup                      %>%                                                              tibble::                      rownames_to_column(var =                      "Player")                                          caput(worldcup,                      ii)                                  
                ##   Player    Team   Position Time Shots Passes Tackles Saves ## ane Abdoun Algeria Midfielder   sixteen     0      6       0     0 ## 2    Abe   Japan Midfielder  351     0    101      14     0              

Install and load the ggplot2 package:

                                                            # install.packages("ggplot2")                                                              library(ggplot2)                                          # install.packages("ggthemes")                                                              library(ggthemes)                                  

Implementing plot guidelines #1

In this section, we'll piece of work on creating a plot like this:

Do the following tasks:

  • Create a simple scatterplot of Time versus Shots for the Globe Cup data. It should look like this:

  • Next, before whatever more coding, talk with your grouping members almost how the ultimate graph we're working on (the one printed at the beginning of this exercise section) is different from the simple one you created with ggplot for the last bullet indicate. Also discuss what you lot can figure out from this new graph that was less clear from a simpler scatterplot of Time versus Shots for this data.
  • Often, in graphs with a lot of points, it'southward difficult to meet some of the points, because they overlap other points. Three strategies to accost this are: (a) make the points smaller; and (b) make the points somewhat transparent. Try doing these first two with the scatterplot y'all're creating. At this point, the plot should look something similar this:

  • Create a new column in the worldcup data called top_four that specifies whether or not the Team for that observation was one of the top four teams in the tournament (Netherlands, Uruguay, Espana, and Germany). Make the colors of the points stand for to whether the team was a superlative-4 team. At this point, the plot should wait something like this:

  • Increase data density: Try changing the theme, to come up upward with a graph with a bit less non-data ink. From the ggthemes package, try some of the following themes: theme_few(), theme_tufte(), theme_stata(), theme_fivethirtyeight(), theme_economist_white(), and theme_wsj(). Pick a theme that helps increase the graph's information density. At this betoken, the plot should expect something like this:

  • Use meaningful labels: Use the labs() role to make a clearer title for the x-centrality. (You may have already written this code in the last department of this practise.) In add-on to setting the x-axis title with the labs function, y'all can also gear up the title for the color scale (employ color = within the labs role). Y'all may want to brand a line pause in the color title– you can use the linebreak graphic symbol (\n) within the grapheme string with the title to practise that. At this point, the plot should await something like this:

  • Provide useful references: The standard time for a soccer game is 90 minutes. In the Earth Loving cup, all teams play at least three games, and so the pinnacle teams continue and play more games. Add a reference line at 270 minutes (i.east., the amount of standard time played for the three games that all teams play). At this point, the plot should expect something like this:

Instance R code

Every bit a reminder, here'due south the code to do a simple scatterplot ot Shots past Time for the worldcup information:

                                                            ggplot(information =                      worldcup)                      +                                                              geom_point(mapping =                      aes(x =                      Time,                      y =                      Shots))                                  

Next, attempt to brand it clearer to encounter the points by making them smaller and somewhat transparent. This can exist done with the size and alpha aesthetics for geom_points. For the size aesthetic, a value smaller than about 2 = smaller than default, larger than almost 2 = larger than default. For the alpha aesthetic, closer to 0 = more tranparent, closer to 1 = more opaque. Equally a reminder, in this case you are irresolute all of the points in the aforementioned mode, so yous volition be setting those aesthetics to constant values. That means that you should specify the values outside of an aes call. This code could make these changes:

                                                            ggplot(information =                      worldcup)                      +                                                              geom_point(mapping =                      aes(10 =                      Time,                      y =                      Shots),                                          size =                      1,                      alpha =                      0.5)                                  

To create a new cavalcade called top_four, get-go create vector that lists those top four teams, then create a logical vector in the dataframe for whether the team for that ascertainment is in one of the peak four teams:

                                      worldcup                      <-                      worldcup                      %>%                                                              mutate(top_4 =                      Squad                      %in%                      c("Spain",                      "Germany",                                                              "Uruguay",                      "Netherlands"))                                          head(worldcup)                                  
                ##                Team   Position Time Shots Passes Tackles Saves     Player top_4 ## Abdoun      Algeria Midfielder   16     0      6       0     0     Abdoun FALSE ## Abe           Japan Midfielder  351     0    101      14     0        Abe Faux ## Abidal       France   Defender  180     0     91       6     0     Abidal False ## Abou Diaby   French republic Midfielder  270     one    111       5     0 Abou Diaby Fake ## Aboubakar  Cameroon    Frontward   46     two     xvi       0     0  Aboubakar FALSE ## Abreu       Uruguay    Forward   72     0     15       0     0      Abreu  True              
                ##    Manner   False    True  ## logical     517      78              

To color points by this variable, apply color = in the aes() part of the ggplot() phone call:

                                                            ggplot(information =                      worldcup)                      +                                                              geom_point(mapping =                      aes(x =                      Time,                      y =                      Shots,                      color =                      top_4),                                          size =                      i,                      alpha =                      0.5)                                  

To increment the data density, endeavour out different themes for the plot. First, I'll save everything nosotros've done so far as the object shot_plot, then I'll try adding unlike themes:

                                      shot_plot                      <-                      ggplot(information =                      worldcup)                      +                                                              geom_point(mapping =                      aes(10 =                      Time,                      y =                      Shots,                      color =                      top_4),                                          size =                      1,                      blastoff =                      0.5)                                        shot_plot                      +                      theme_few()                                  

                                      shot_plot                      +                      theme_tufte()                                  

                                      shot_plot                      +                      theme_fivethirtyeight()                                  

                                      shot_plot                      +                      theme_stata()                                  

                                      shot_plot                      +                      theme_economist_white()                                  

The data density is increased with the theme_few() theme, and so I'll apply that:

                                                            ggplot(information =                      worldcup)                      +                                                              geom_point(mapping =                      aes(x =                      Fourth dimension,                      y =                      Shots,                      color =                      top_4),                                          size =                      i,                      blastoff =                      0.five)                      +                                                              theme_few()                                  

To alter the titles for some of the scales (the 10-axis and colour calibration), you lot can use the labs() function. Note that you lot can use \n to add a line suspension inside one of these titles (I've washed that for the title for the color scale):

                                                            ggplot(data =                      worldcup)                      +                                                              geom_point(mapping =                      aes(ten =                      Fourth dimension,                      y =                      Shots,                      color =                      top_4),                                          size =                      1,                      alpha =                      0.5)                      +                                                              theme_few()                      +                                                              labs(x =                      "Time played in World Cup (minutes)",                                          colour =                      "Team's final                      \n                      ranking")                                  

Equally an extra annotation, if you want to create nicer labels for the legend for color, catechumen the top_four column into the gene class, with the labels you want to use in the figure fable:

                                      worldcup                      <-                      worldcup                      %>%                                                              mutate(top_4 =                      gene(top_4,                      levels =                      c(True,                      Simulated),                                          labels =                      c("Top four",                      "Other")))                                          summary(worldcup$top_4)                                  
                ## Superlative four Other  ##    78   517              
                                                            ggplot(data =                      worldcup)                      +                                                              geom_point(mapping =                      aes(10 =                      Time,                      y =                      Shots,                      color =                      top_4),                                          size =                      i,                      alpha =                      0.v)                      +                                                              theme_few()                      +                                                              labs(x =                      "Time played in World Cup (minutes)",                                          color =                      "Team's last                      \due north                      ranking")                                  

To add together a reference line at 270 minutes of fourth dimension, use the geom_vline() function. You'll want to make it a low-cal color (like low-cal gray) and dashed or dotted (linetype of 2 or 3), so it won't be too prominent on the graph:

                                                            ggplot(data =                      worldcup)                      +                                                              geom_vline(xintercept =                      270,                      color =                      "lightgray",                      linetype =                      2)                      +                                                              geom_point(mapping =                      aes(x =                      Fourth dimension,                      y =                      Shots,                      color =                      top_4),                                          size =                      1,                      alpha =                      0.v)                      +                                                              theme_few()                      +                                                              labs(ten =                      "Fourth dimension played in World Cup (minutes)",                                          color =                      "Squad'due south last                      \n                      ranking")                                  

Implementing plot guidelines #2

  • Highlighting interesting data: Who had the most shots in the 2010 World Cup? Was he on a elevation-iv team? Use geom_text() to label his point on the graph with his name (try out some different values of hjust and vjust in this role call to get the characterization in a place you similar). At this point, the plot should look something similar this:

  • For labeling the player with the top number of shots, instead of only using the player'south name, use the following format: "[Player's name], [Player's team]." (Hint: Y'all may desire to use mutate to add together a new column, where you used paste0 to paste together the histrion's proper noun, ", ", and the team proper noun.) At this bespeak, the plot should look something like this:

  • Create small multiples. The relationship between fourth dimension played and shots taken is probably different by the players' positions. Use faceting to create different graphs for each position. At this betoken, the plot should look something like this:

  • Make gild meaningful: What order are the faceted graphs currently in? Offensive players have more than chances to take shots than defensive players, and then that might exist a useful ordering for the facets. Re-order the Position factor column to go from nearest your own goal to nearest the opponents goal, then re-plot the graph from the previous step.

Case R lawmaking

To add together a text label with just the histrion with the most shots, you'll want to create a new dataframe with only the height role player. Y'all tin can use the top_n function to do that (the wt option is specifying that we desire the superlative player in terms of values in the Shots column):

                                      top_player                      <-                      worldcup                      %>%                                                              top_n(due north =                      1,                      wt =                      Shots)                                  

Now you can use geom_text() to characterization this player'southward betoken on the graph with his proper noun. You lot may demand to mess effectually with some of the options in geom_text(), like size, hjust, and vjust (hjust and vjust say where, in relation to the point location, to put the label), to get something you're happy with.

                                      worldcup                      %>%                                                              mutate(top_4 =                      Team                      %in%                      c("Netherlands",                      "Uruguay",                      "Spain",                      "Federal republic of germany"))                      %>%                                                              ggplot(aes(x =                      Time,                      y =                      Shots,                      color =                      top_4))                      +                                                              geom_vline(xintercept =                      90                      *                      3,                      colour =                      "grey",                      linetype =                      2)                      +                                                              geom_point(blastoff =                      0.5,                      size =                      1)                      +                                                              geom_text(information =                      top_player,                      aes(label =                      Player,                      color =                      NULL),                                          hjust =                      1.2,                      vjust =                      0.4)                      +                                                              labs(x =                      "Fourth dimension played in World Cup (minutes)",                                          colour =                      "Team's final                      \n                      ranking")                      +                                                              theme_few()                                  

If you want to put both the player's proper noun and his team, you tin can add together a mutate() function when you create the new dataframe with just the meridian player, so use this for the label:

                                      top_player                      <-                      worldcup                      %>%                                                              top_n(n =                      one,                      wt =                      Shots)                      %>%                                                              mutate(label =                      paste0(Player,                      ", ", Team))                                        worldcup                      %>%                                                              mutate(top_4 =                      Team                      %in%                      c("Netherlands",                      "Uruguay",                      "Espana",                      "Frg"))                      %>%                                                              ggplot(aes(x =                      Fourth dimension,                      y =                      Shots,                      color =                      top_4))                      +                                                              geom_vline(xintercept =                      ninety                      *                      3,                      color =                      "gray",                      linetype =                      ii)                      +                                                              geom_point(alpha =                      0.5,                      size =                      ane)                      +                                                              geom_text(data =                      top_player,                                                              aes(label =                      label,                      colour =                      Zippo),                                          hjust =                      one.1,                      vjust =                      0.4)                      +                                                              labs(x =                      "Fourth dimension played in Globe Cup (minutes)",                                          color =                      "Team's final                      \n                      ranking")                      +                                                              theme_few()                                  

To create pocket-size multiples, apply the facet_wrap() command (y'all'll probably desire to utilise ncol to specify to employ 4 columns):

                                      top_player                      <-                      worldcup                      %>%                                                              top_n(n =                      1,                      wt =                      Shots)                      %>%                                                              mutate(label =                      paste0(Player,                      ", ", Team))                                        worldcup                      %>%                                                              mutate(top_4 =                      Team                      %in%                      c("Netherlands",                      "Uruguay",                      "Spain",                      "Germany"))                      %>%                                                              ggplot(aes(x =                      Time,                      y =                      Shots,                      color =                      top_4))                      +                                                              geom_vline(xintercept =                      xc                      *                      3,                      colour =                      "gray",                      linetype =                      2)                      +                                                              geom_point(alpha =                      0.five,                      size =                      i)                      +                                                              geom_text(information =                      top_player,                                                              aes(label =                      label,                      colour =                      NULL),                                          hjust =                      ane.i,                      vjust =                      0.4)                      +                                                              labs(ten =                      "Time played in Globe Cup (minutes)",                                          color =                      "Team'due south final                      \due north                      ranking")                      +                                                              theme_few()                      +                                                              facet_wrap(~                      Position,                      ncol =                      4)                                  

To re-order the Position cavalcade of the dataframe, add together a mutate statement earlier you pipage into the plotting lawmaking. Apply the levels option of the factor() function– whatever lodge you put the factors in for this argument will exist the new club in which R saves the levels of this factor.

                                      worldcup                      %>%                                                              mutate(top_4 =                      Team                      %in%                      c("Netherlands",                      "Uruguay",                      "Kingdom of spain",                      "Federal republic of germany"),                                          Position =                      factor(Position,                      levels =                      c("Goalkeeper",                      "Defender",                                                              "Midfielder",                      "Frontwards")))                      %>%                                                              ggplot()                      +                                                              geom_vline(xintercept =                      90                      *                      iii,                      colour =                      "gray",                      linetype =                      ii)                      +                                                              geom_point(aes(10 =                      Time,                      y =                      Shots,                      color =                      top_4),                                          alpha =                      0.5,                      size =                      ane)                      +                                                              geom_text(data =                      top_player,                                                              aes(x =                      Time,                      y =                      Shots,                      label =                      label),                                          hjust =                      i.1,                      vjust =                      0.four)                      +                                                              labs(ten =                      "Fourth dimension played in World Cup (minutes)",                                          color =                      "Team's final                      \n                      ranking")                      +                                                              theme_few()                      +                                                              facet_wrap(~                      Position,                      ncol =                      four)                                  

Notation from this code example that you can use the levels role to find out the levels and their order for a factor-class vector.

                                      worldcup                      <-                      worldcup                      %>%                                                              mutate(Position =                      factor(Position,                                                              levels =                      c("Goalkeeper",                      "Defender",                                          "Midfielder",                      "Forward")))                                          levels(worldcup$Position)                                  
                ## [1] "Goalkeeper" "Defender"   "Midfielder" "Forward"              

Data visualization cheatsheet

RStudio comes with some fantabulous cheatsheets, which provide quick references to functions and code y'all might find useful for different tasks. For this part of the group practice, you'll explore their cheatsheet for data visualization, both to learn some new ggplot2 code and to go familiar with how to use this cheatsheet as you do your own analysis.

  • Open up the data visualization cheatsheet. You tin do this from RStudio by going to "Help" -> "Cheatsheets" -> "Information Visualization with ggplot2."
  • Notice that different sections requite examples with some datasets that come with either base R or ggplot2. For example, under the "Graphical Primitives" section, there is code defining the object a as a ggplot object using the "seals" dataset: a <- ggplot(seals, aes(x = long, y = lat)).
  • Go through the cheatsheet and listing all of the example datasets that are used in this cheatsheet. Open their helpfiles to learn more than about the data.
  • Create the case datasets a through l and southward through t using the lawmaking given on the cheatsheet.
  • Pick at to the lowest degree one example to try out from each of the post-obit sections: "Graphical Primitives," "One Variable," at least three subsections of "Two Variables," "Three Variables," "Scales," "Faceting," and "Position Adjustments." Equally you attempt these, try to figure out any aesthetics that you aren't familiar with (e.g., ymin, ymax). Also, apply helpfiles for the geoms to wait up parameters yous aren't familiar with (due east.g., stat for geom_area). If you lot can't figure out how to translate a plot, cheque the helpfile for the associated geom. Notation: For the n geom used in "scales," it should be defined every bit n <- d + geom_bar(aes(fill = fl)).

Example R code

The lawmaking for opening the helpfiles for the example datasets is:

                                      ?seals                    ?economics                    ?mpg                    ?diamonds                    ?USArrests                                  

Note that, for USArrests, but some of the columns are pulled out (due east.g., murder = USArrests$murder) to use in the information case dataframe. Further, the "Visualizing fault" examples use a dataframe created specifically for these examples, called df.

Some of the base of operations R and ggplot2 example datasets have go adequately well-known. Some that you'll see very frequently in examples are the iris, mpg, and diamonds datasets.

All of the code to create the datasets a through 50 and s through t is given somewhere on the cheatsheet. Hither it is in full:

                                      a                      <-                      ggplot(seals,                      aes(10 =                      long,                      y =                      lat))                    b                      <-                      ggplot(economics,                      aes(date, unemploy))                    c                      <-                      ggplot(mpg,                      aes(hwy))                    d                      <-                      ggplot(mpg,                      aes(fl))                    e                      <-                      ggplot(mpg,                      aes(cty, hwy))                    f                      <-                      ggplot(mpg,                      aes(course, hwy))                    yard                      <-                      ggplot(diamonds,                      aes(cut, colour))                    h                      <-                      ggplot(diamonds,                      aes(carat, cost))                    i                      <-                      ggplot(economics,                      aes(date, unemploy))                    df                      <-                      data.frame(grp =                      c("A",                      "B"),                      fit =                      4.v,                      se =                      1                      :                      two)                    j                      <-                      ggplot(df,                      aes(grp, fit,                      ymin =                      fit                      -                      se,                      ymax =                      fit                      +                      se))                    data                      <-                      information.frame(murder =                      USArrests$Murder,                                          land =                      tolower(rownames(USArrests)))                    map                      <-                      map_data("state")                    chiliad                      <-                      ggplot(information,                      aes(fill =                      murder))                    seals$z                      <-                      with(seals,                      sqrt(delta_long^                      2                      +                      delta_lat^                      2))                    l                      <-                      ggplot(seals,                      aes(long, lat))                    s                      <-                      ggplot(mpg,                      aes(fl,                      make full =                      drv))                    t                      <-                      ggplot(mpg,                      aes(cty, hwy))                      +                      geom_point()                                  

Notice that, in some places, the aesthetics are defined using the total artful proper name-value pair (eastward.g., aes(x = long, y = lat)), while in other places the code relies on position for defining which column of a dataframe maps to which artful (e.m., aes(cty, hwy) or aes(fl)). Either is fine, although relying on position tin can result in errors if you lot are not very familiar with the order in which parameters are defined for a office.

This code will vary based on the examples you attempt, but hither is some code for one set of examples:

                                      b                      +                      geom_ribbon(aes(ymin =                      unemploy                      -                      900,                      ymax =                      unemploy                      +                      900))                    c                      +                      geom_dotplot()                    f                      +                      geom_violin(scale =                      "area")                    h                      +                      geom_hex()                    j                      +                      geom_pointrange()                    k                      +                      geom_map(aes(map_id =                      state),                      map =                      map)                      +                                                              expand_limits(x =                      map$long,                      y =                      map$lat)                    50                      +                      geom_contour(aes(z =                      z))                    n                      <-                      d                      +                      geom_bar(aes(fill =                      fl))                    n                      +                      scale_fill_brewer(palette =                      "Dejection")                    o                      <-                      c                      +                      geom_dotplot(aes(fill =                      ..10..))                    o                      +                      scale_fill_gradient(low =                      "red",                      high =                      "yellow")                    t                      +                      facet_grid(year                      ~                      fl)                    s                      +                      geom_bar(position =                      "fill up")