Principle of Analytic Graphics

Exploratory Graphs

One Dimension Summary of Data

  • summary(data) = returns min, 1st quartile, median, mean, 3rd quartile, max
  • boxplot(data$v1, col = “blue”) = produces a box with middles 50% highlighted in the specified colour. Graphical output of summary().
  • histogram(data$v1, col = "green") = produces a histogram with specified breaks and colour
    • breaks = 100 = number of bars. The higher the number is the smaller/narrower the histogram columns are. Too big = noisey / rough, too small = can’t see shape of distribution
  • rug(data$v1) = density plot, adds strip under histrogram indicating location of each data point
  • barplot(table(data), col = "wheat", main = "Title") = produces a bar graph, usually for categorical data

  • overlaying features
    • abline(h/v = 12) = overlays a horizontal (boxplot) or vertical (histogram) line at specificed location.
    • col = “red” = specifies color
    • lwd = 4 = line width
    • lty = 2 = line type

    • Horziontal line useful over boxplot to see how much data falls at specified value.
    • example: How many islands speak 12 languages?
      • boxplot(data);abline(h = 12). If line is above box, then less than 75% (3rd quartile) of islands speak 12 languages.
    • Vertical line useful over histogram
      • abline(v = meadian(data), col = blue, lwd = 4) = displays median line. note boxplot contains median line as feature, histogram does not

Two Dimensional Summaries

  • multiple/overlayed 1-D plots (Lattice/ggplot2 )
  • box plots: boxplot(pm25 ~ region, data = pollution, col = “red”)
  • air pollution in east is higher than west. Lower in west but more spread out, with high extreme values

  • histogram:
    • par(mfrow = c(2, 1), mar = c(4, 4, 2, 1)) = set margin
    • hist(subset(pollution, region == "east")$pm25, col = "green") = first histogram
    • hist(subset(pollution, region == "west")$pm25, col = "green") = second histogram
  • air pollution in east is higher than west, but no extreme values. Lower in west but more spread out, with high extreme values

  • scatterplot
  • with(pollution, plot(latitude, pm25)) = scatterplot of pollution going south to north (latitude). Plots two variables.
  • abline(h = 12, lwd = 2, lty = 2) = plots horizontal dotted line with(pollution, plot(latitude, pm25, col = region))` = same as first histogram, but data dots are coloured by region. Red = west, Black = East. Plots three variables.

  • mutiple scatterplot
  • another way of looking at three variables instead of using colour
  • par(mfrow = c(1, 2), mar = c(5, 4, 2, 1)) = sets margins
    • with(subset(pollution, region == "west"), plot(latitude, pm25, main = "West")) = left scatterplot
    • with(subset(pollution, region == "east"), plot(latitude, pm25, main = "East")) = right scatterplot

* both scatterplots (single & multiple) show that for pollution is higher in mid-latitudes than low / high latitudes for both eastern and western regions.

  • scatterplot
  • with(pollution, plot(latitude, pm25, col = region))
    • abline(h = 12, lwd = 2, lty = 2) = plots horizontal dotted line showing the standard air quality level
    • plot(jitter(child, 4)~parent, galton) = spreads out data points at the same position to simulate measurement error/make high frequency more visibble
  • summary
  • quick and dirty
  • Explore basic questions and hypotheses (perhaps rule them out)

Process of Making a Plot/Considerations

  • where will plot be made? screen or file?
  • how will plot be used? viewing on screen/web browser/print/presentation?
  • large amount of data vs few points?
  • need to be able to dynamically resize?
  • plotting system: base, lattice, ggplot2?

Base Plotting

Base Graphics Functions and Parameters

  • arguments
    • pch: plotting symbol (default = open circle)
    • lty: line type (default is solid) * 0=blank, 1=solid (default), 2=dashed, 3=dotted, 4=dotdash, 5=longdash, 6=twodash
    • lwd: line width (integer)
    • col: plotting color (number string or hexcode, colors() returns vector of colors)
    • xlab, ylab: x-y label character strings
    • cex: numerical value giving the amount by which plotting text/symbols should be magnified relative to the default
      • cex = 0.15 * variable: plot size as an additional variable
  • par() function = specifies global graphics parameters, affects all plots in an R session (can be overridden)
    • las: orientation of axis labels
    • bg: background color
    • mar: margin size (order = bottom left top right)
    • oma: outer margin size (default = 0 for all sides)
    • mfrow: number of plots per row, column (plots are filled row-wise)
    • mfcol: number of plots per row, column (plots are filled column-wise)
    • can verify all above parameters by calling par("parameter")
  • plotting functions
    • plot = make a scatterplot, or other plot depending on class of data
    • lines = add lines to existing plot. I.E. Connecting dots in a time-series plot
    • points = add points to a plot. I.E. add a different group or subset afterwards
    • text = add text labels within the plot using specified x,y coordinates
    • title = add text labels outside the plot (x/y-axis, title, subtitle, outer margin)
    • mtext = add text to inner/outer margin of plot
    • axis = add axis ticks/labels

Base Plot Example

library(datasets)
# type =“n” sets up the plot and does not fill it with data
with(airquality, plot(Wind, Ozone, main = "Ozone and Wind in New York City"))
# subsets of data are plotted here using different colors
with(subset(airquality, Month == 5), points(Wind, Ozone, col = "blue"))
with(subset(airquality, Month != 5), points(Wind, Ozone, col = "red"))
legend("topright", pch = 1, col = c("blue", "red"), legend = c("May", "Other Months"))
# regression line is produced here
model <- lm(Ozone ~ Wind, airquality)
abline(model, lwd = 2)

lattice Plotting System

lattice Functions and Parameters

  • Funtions
    • xyplot() = main function for creating scatterplots
    • bwplot() = box and whiskers plots (box plots)
    • histogram() = histograms
    • stripplot() = box plot with actual points
    • dotplot() = plot dots on “violin strings”
    • splom() = scatterplot matrix (like pairs() in base plotting system)
    • levelplot()/contourplot() = plotting image data
  • Arguments for xyplot(y ~ x | f * g, data, layout, panel)
    • default blue open circles for data points
    • formula notation is used here (~) = left hand side is the y-axis variable, and the right hand side is the x-axis variable
    • f/g = conditioning/categorical variables (optional)
      • basically creates multi-panelled plots (for different factor levels)
      • * indicates interaction between two variables
      • intuitively, the xyplot displays a graph between x and y for every level of f and g
    • data = the data frame/list from which the variables should be looked up
      • if nothing is passed, the parent frame is used (searching for variables in the workspace)
      • if no other arguments are passed, defaults will be used
    • layout = specifies how the different plots will appear
      • layout = c(5, 1) = produces 5 subplots in a horizontal fashion
      • padding/spacing/margin automatically set
    • [optional] panel function can be added to control what is plotted inside each panel of the plot
      • panel functions receive x/y coordinates of the data points in their panel (along with any additional arguments)
      • ?panel.xyplot = brings up documentation for the panel functions
      • Note: no base plot functions can be used for lattice plots

ggplot2 system