Reproducible research: concepts and ideas
Replication
- ultimate standard for strengthening of scientific evidence
- reproduce findings and conduct studies with independent investigators, data, labrotories, analytical methods
- important in studies that affect policy
- HOWEVER, some studies can’t be replicated due to time/money/unique
- the middle ground between replication (gold) and no replication (bronze), is reproducibillity (silver)
Reproducibillity
- Bridges gap between replication and leaving the study as it is (no replication), reproducbillity is a minimum standard
- take data/code and replicate findings â validate their findings
- Why do we need reproducible research
- datasets and studies are getting larger and more expensive, so not feasible to replicate
- Example: Air Pollution Studies. Data is publically available, code for analytical methods used is provided.
- Looking for small health effects
- Important for informing policy
- Complex statistical methods are used
- By making the data reproducible, the same findings can be reproduced quickly and by anyone
- Research Pipeline

- Necessities for Reproducible Research
- Analytic data: data that was used for the analysis. Not raw data, which may include data not used in analysis
- Analytic code
- documentation of code and data
- Standard means of distribution
- Challenges
- authors make considerable effort to make data/results available
- readers must download data/results individually and piece it all together
- readers may not have the same resources as authors (computing clusters, etc.)
- few tools for readers/authors
- In reality, authors just put their content online (often disorganized, but there are some central databases for various fields) and readers have to find the data and piece it together
Literate/Statistical Programming
- Notion of an article is a stream of text and code
- original idea came from Don Knuth
- statistical analysis is divided into text and code chunks
- text explains whats going on
- code loads data/computes results
- presentation code formats results (tables, graphics, etc)
- Literate Programming requires
- documentation language - [weave] produce human-readable docs (HTML/PDF)
- programming language - [tangle] produce machine-readable code
- Sweave - original program in R designed to do literate programming
- LaTeX for documentation, R for programming
- developed by Friedrich Leisch (member of R core)
- still maintained by R core
- limitations
- focus on LaTeX, difficult to learn markup languages
- lacks features like caching, multiple plots per chunk, mixing programming languages with many other technical terms
- not frequently updated/actively developed
knitr
= alternative literate programming package
- R for programming (can use others as well)
- LaTeX/HTML/Markdown for documentation
- developed by Iowa grad student Yihui Xie
- reproducible research = minimum standard for non-replicable/computationally intensive studies
- infrastructure still need for creating/distributing reproducible documents
Structure of Data Analysis
Key Challenge in Data Analysis
“Ask yourselves, what problem have you solved, ever, that was worth solving, where you knew all of the given information in advance? Where you didnât have a surplus of information and have to filter it out, or you had insufficient information and have to go find some?â - Dan Myer
Define general question
- most powerful dimension reduction tool, narrowing down question reduces noise/simply problem
- Narrowing down question removes variables. Height, location, sample. Reduces dimensionality of problem
- Start with general question
- Can I automatically detect areas of plankton growth?
- Make it concrete: translate question using data analysis terminology
- Can I use quantitative characeristics of SST images to classify areas around the coast of the UK as areas of high plankton growth?
Define ideal Dataset link
* **Descriptive** - a whole population. All Pokemon
* Exploratory - random sample with many variables measured.
* Choose random pokemon, measure height, speed, attack
* inferential - the right population, randomly sampled (drawing conclusion from sample to larger popluation)
* Water and fire type. Water are quicker than fire in sea.
* predictive - training set and test data set. (build model and classifier)
* Pokemon raised by boys are stronger.
* causal - experimental data from randomised study
* if modify one thing, it causes another
* Predictive = training set and dataset (need training and test data sets from the same population to build a model and classifier)
* Causal = (if this is modified, then that happens) experimental data from randomized study
* Mechanistic = data from all components of system that you want to describe
Determine what data you can access
- free data on the web.
- Sometimes if there’s no data you will need to generate it your self
Obtain The Data
- try to obtain the raw data and reference the source
- if loading data from internet source, record at the minimum url and time of access
Clean The Data
- raw data need to be processed to be fed into modeling program
- if already processed, need to understand how and understand documentation on how the pre-processing was done
- understanding source of data (i.e. survey - how survey was done?)
- may need to reformat/subsample data \(\rightarrow\) need to record steps
- determine if data is good enough to solve the problem \(\rightarrow\) if not, find other data/quit/change question to avoid misleading conclusions
Exploratory Data Analysis
- Subsampling dataset
for the purpose of the SPAM question, data needs to be split into training and test set (Predictive)
- model = email is spam or not spam
- training = dataset used to build model
test = dataset used to determine how good model is at making predictions
- example
# If it isn't installed, install the kernlab package with install.packages()
library(kernlab)
data(spam)
# Perform subsampling (coin is flipped to split data into 2 pieces)
set.seed(3435)
trainIndicator = rbinom(4601, size = 1, prob = 0.5)
# Roughly 2314 in one half and 2287 in other half
table(trainIndicator)
## trainIndicator
## 0 1
## 2314 2287
trainSpam = spam[trainIndicator == 1, ]
testSpam = spam[trainIndicator == 0, ]
- look at one/two-dimensional summaries of data: what data lookes like, what the distribution looks like, relationships between variables etc
names()
, summary()
, head()
= Rownames, frequency word counts
table(trainSpam$type)
= classification of emails (spam or not spam)
- Check for missing data
- Create exploratory plots = rough sense of what data looks like
plot(trainSpam$capitalAve ~ trainSpam$type)
= average number of capital letters.
- difficult to look at picture because data highly skewed
- If data is skewed and difficult to look at it, useful to look at log transformation of variable
plot(log10(trainSpam$capitalAve + 1) ~ trainSpam$type)
- because data is skewed \(\rightarrow\) use
log
- a lot of zeroes \(\rightarrow\)
+1
, because taking log of zero does not make sense
- Perform exploratory analysis (e.g. clustering)
perform exploratory analysis (clustering) * relationships between predictors * plot(log10(trainSpam[, 1:4] + 1))
= some characteristics are related, some are not * log transformations diagrams between pairs of variables

- hierarchical cluster analysis (Dendrogram) = what words/characteristics tend to cluster together
hCluster = hclust(dist(t(trainSpam[, 1:57])))
- Note: Dendrograms can be sensitive to skewness in distribution of individual variables, so maybe useful to transform predictor (see below)
hClusterUpdated = hclust(dist(t(log10(trainSpam[, 1:55] + 1))))

Statistical prediction / modeling
- Once you have done exploratory analysis (uni-variate statistics (summary statistics), bi-variate statitics (plots) and clustering) you can move onto more sophisticated modelling and prediction modelling
- Should be informed by the results of your exploratory analysis
- Exact methods depend on the question of interest
- processing/transformations should be accounted for when necesscary
- go through each variable in the dataset and fit logistic regression to see if we can predict if an email is spam or not by using a single variable
trainSpam$numType = as.numeric(trainSpam$type) - 1
costFunction = function(x, y) sum(x != (y > 0.5))
cvError = rep(NA, 55)
library(boot)
for (i in 1:55) {
# creates formula with one variable and the result
lmFormula = reformulate(names(trainSpam)[i], response = "numType")
glmFit = glm(lmFormula, family = "binomial", data = trainSpam)
# cross validated error
cvError[i] = cv.glm(trainSpam, glmFit, costFunction, 2)$delta[2]
}
# Which predictor has minimum cross-validated error?
names(trainSpam)[which.min(cvError)]
## [1] "charDollar"
- think about measures/sources of uncertainty
# Use the best model from the group
predictionModel = glm(numType ~ charDollar,family="binomial",data=trainSpam)
# Get predictions on the test set
predictionTest = predict(predictionModel,testSpam)
predictedSpam = rep("nonspam",dim(testSpam)[1])
# Classify as 'spam' for those with prob > 0.5
predictedSpam[predictionModel$fitted > 0.5] = "spam"
# Classification table
table(predictedSpam, testSpam$type)
##
## predictedSpam nonspam spam
## nonspam 1346 458
## spam 61 449
# Error rate
(61 + 458)/(1346 + 458 + 61 + 449)
## [1] 0.2242869
Run a linear regression model through every variable, comparing it email type (spam, not spam)
- Calculate CV error of each variable
- choose variable with lowest CV error this is the single variable which is best at prediciting spam
- variable with lowest error is
$
- Make model using
$
- Make model using the variable of lowest error (
$
), comparing it to email type (spam, not spam)
- Make predictions using model
- Use model to make predictions on test data set, i.e. predict outcomes on test dataset to see how well we do
- calculate error rate of model using classification table
Challenge results
- Challenge all steps
- Question
- Data source
- Processing
- Analysis
- Conclusions
- Challenge measures of uncertainty. Are these appropiate measures of uncertainity?
- Challenge choice of models. Why is your the suitable model for the problem? Why did you choose the parts in your models?
- Think of potential alternative analysis. How could others expand on your study? Are there alternatives that are useful in some way or produce better predictions?
Synthesize/write-up results
Coding standards in R
- write code in text editor and save as text file
- indenting (4 space minimum, 8 preferable)
- limit the width of your code (80 columns)
- limit the length of functions, splitting functions into seperate chunks
- Markdown = text-to-HTML conversion tool for web writers.
- simplified version of markup language
- write using easy-to-read, easy-to-write plain text format
- any text editor can create markdown document
- convert the text to structurally valid XHTML/HTML
- created by John Gruber
- italics =
*text*
- bold =
**text**
- main heading =
#Heading
- secondary heading =
##Heading
- tertiary main heading =
###Heading
- unordered list =
- first element
- ordered list =
1. first element
- the number in front actually doesnât matter, as long as there is a number there, markdown will be compiled to an ordered list (no need for renumbering)
- links =
[text](url)
- OR
[text][1]
\(\rightarrow\) later in the document, define all of the links in this format: [1]: url "text"
- new lines = requires a double space after the end of a line
R Markdown
- integration of R code and markdown
- allows creating documents containing “live” R code
- R code is evaluated as a part of the processing of the markdown
- results from R code inserted into markdown document
- core tool for literate statistical programming
- pros
- text/code all in one place in logical order
- data/results automatically updated to reflect external changes
- code is live = automatic test when building document
- cons
- text/code all in one place, can be difficult to read especially if thereâs a lot of code
- can substantially slowdown processing of documents
knitr
package
- written by YihuiXie, built into RStudio
- support Markdown, LaTeX, HTML as documentation languages
- Exports PDF/HTML
- good for manuals, short/medium technical documents, tutorials, periodic reports, data preprocessing documents/summaries
- not good for long research articles, complex time-consuming computations, precisely formatted documents
- evaluates R markdown documents and return/records the results, and write out a Markdown files
- Markdown file can then be converted into HTML using markdown package
- solidify package converts R markdown into presentation slides
- In RStudio, create new R Markdown files by clicking New \(\rightarrow\) R Markdown
========
= indicates title of document (large text)
$expression$
= indicates LaTeX expression/formatting
- `
text
` = changes text to code format (typewriter font)
- ```
{r name, echo = FALSE, results = hide}...
``` = R code chunk
name
= name of the code chunk
echo = FALSE
= turns off the echo of the R code chunk, which means display only the result
- Note: by default code in code chunk is echoed = print code AND results
results = hide
= hides the results from being placed in the markdown document
- inline text computations
- `
r
variable
` = prints the value of that variable directly inline with the text
- incorporating graphics
- ```
{r scatterplot, fig.height = 4, fig.width = 6} ... plot() ...
``` = inserts a plot into markdown document
scatterplot
= name of this code chunk (can be anything)
fig.height = 4
= adjusts height of the figure, specifying this alone will produce a rectangular plot rather than a square one by default
fig.width = 6
= adjusts width of the figure
- knitr produces HTML, with the image embedded in HTML using base64 encoding
- does not depend on external image files
- not efficient but highly transportable
- incorporating tables (xtable package:
install.packages("xtable")
)
xtable
prints the table in html format, which is better presented than plain text normally
library(datasets)
library(xtable)
## Warning: package 'xtable' was built under R version 3.1.3
fit <- lm(Ozone ~ Wind + Temp + Solar.R, data = airquality)
xt <- xtable(summary(fit))
print(xt, "html")
|
Estimate
|
Std. Error
|
t value
|
Pr(>|t|)
|
(Intercept)
|
-64.3421
|
23.0547
|
-2.79
|
0.0062
|
Wind
|
-3.3336
|
0.6544
|
-5.09
|
0.0000
|
Temp
|
1.6521
|
0.2535
|
6.52
|
0.0000
|
Solar.R
|
0.0598
|
0.0232
|
2.58
|
0.0112
|
- setting global options
- ```
{r setoptions, echo = FALSE} opts_chunk$set(echo = FALSE, results = "hide")
``` = sets the default option to not print the code/results unless otherwise specified
- common options
- output:
results = "asis"
OR "hide"
"asis"
= output to stay in original format and not compiled into HTML
- output:
echo = TRUE
OR FALSE
- figures:
fig.height = numeric
- figures:
fig.width = numeric
- caching computations
- add argument to code chunk:
cache = TRUE
- computes and stores result of code the first time it is run, and calls the stored result directly from file for each subsequent call
- useful for complex computations
- caveats:
- if data/code changes, you will need to re-run cached code chunks
- dependencies not checked explicitly (changes in other parts of the code \(\rightarrow\) need to re-run the cached code)
- if code does something outside of the document (i.e. produce a png file), the operation cannot be cached
- “Knit HTML” button to process the document
- alternatively, when not in RStudio, the process can be accomplished through the following
library(knitr)
setwd(<working directory>)
knit2html("document.Rmd")
browseURL("document.html")
- processing of
knitr
documents
- author drafts R Markdown (.Rmd) \(\rightarrow\)
knitr
processes file to Markdown (.md) \(\rightarrow\) knitr
converts file to HTML
- Note: author should NOT edit/save the .md or .html document until you are done with the document
Communicating Results
- when presenting information, it is important to breakdown results of an analysis into different levels of granularity/detail
- below are listed in order of least \(\rightarrow\) most specific
- built-in service with RStudio and works seamlessly with
knitr
- after kniting the R Markdown file in HTML, you can click on Publish to publish the HTML document on RPubs
- Note: all publications to RPups are publicly available immediately
Reproducible Research Checklist
- Checklist
- Are we doing good science?
- Was any part of this analysis done by hand?
- If so, are those parts precisely document?
- Does the documentation match reality?
- Have we taught a computer to do as much as possible (i.e. coded)?
- Are we using a version control system?
- Have we documented our software environment?
- Have we saved any output that we cannot reconstruct from original data + code?
- How far back in the analysis pipeline can we go before our results are no longer (automatically) reproducible?
Do
- start with good science
- coherent/focused question simplifies many problems
- choose something interesting that will motivate you
- collaborate with others to reinforce good practices and habits
- teach a computer
- worthwhile to automate all tasks through script/programming
- code = precise instructions to process/analyze data
- teaching the computer almost guarantee’s reproducibility
download.file("url", "filename")
= convenient way to download file
- full URL specified (instead series of links/clicks)
- name of file specified
- directory specified
- code can be executed in R (as long as link is available)
- use version control
- GitHub/BitBucket are good tools
- helps to slow down
- forces the author to think about changes made and commit changes and keep track of analysis performed
- helps to keep track of history/snapshots
- allows reverting to old versions
- keep track of software environment
- some tools/datasets may only work on certain software/environment
- software and computing environment are critical to reproducing analysis
- everything should be documented
- computer architecture: CPU (Intel, AMD, ARM), GPUs, 32 vs 64bit
- operating system: Windows, Mac OS, Linux/Unix
- software toolchain: compilers, interpreters, command shell, programming languages (C, Perl, Python, etc.), database backends, data analysis software
- supporting software/infrastructure: Libraries, R packages, dependencies
- external dependencies: web sites, data repositories (data source), remote databases, software repositories
- version numbers: ideally, for everything (if available)
sessionInfo()
= prints R version, operating system, local, base/attached/utilized packages
- set random number generator seed
- random number generators produce pseudo-random numbers based on initial seed (usually number/set of numbers)
set.seed()
can be used to specify seet for random generator in R
- setting seed allows stream of random numbers to be reproducible
- whenever you need to generate steam of random numbers for non-trivial purpose (i.e. simulations, Markov Chain Monte Carlo analysis), always set the seed.
- think about entire pipeline
- data analysis is lengthy process, important to ensure each piece is reproducible
- final product is important, but the process is just as important
- raw data \(\rightarrow\) processed data \(\rightarrow\) analysis \(\rightarrow\) report
- the more of the pipeline that is made reproducible, the more credible the results are
Don’ts
- do things by hand
- may lead to unreproducible results
- edit spreadsheets using Excel
- remove outliers (without noting criteria)
- edit tables/figures
- validate/quality control for data
- downloading data from website (clicking on link)
- need lengthy set of instructions to obtain the same data set
- moving/split/reformat data (no record of what was done)
- if necessary, manual tasks must be documented precisely (account for people with different background/context)
- point and click
- graphical user interfaces (GUI) make it easy to process/analyze data
- GUIs are intuitive to use but actions are difficult to track and for others to reproduce
- some GUIs include log files that can be utilized for review
- any interactive software should be carefully used to ensure all results can be reproduced
- text editors are usually ok, but when in doubt, document it
- save output for convenience
- avoid saving data analysis output
- tables, summaries, figures, processed, data
- output saved stand-alone without code/steps on how it is produced is not reproducible
- when data changes or error is detected in parts of analysis the output is dependent on, the original graph/table will not be updated
- intermediate files (processed data) are ok to keep but clear/precise documentation must be created
- should save the data/code instead of the output
Replication vs Reproducibility
- Replication
- focuses on validity of scientific claim
- if a study states “if X is related to Y” then we ask “is the claim true?”
- need to reproduce results with new investigators, data, analytical methods, laboratories, instruments, etc.
- particularly important in studies that can impact broad policy or regulatory decisions
- stands as ultimate standard for strengthening scientific evidence
- Reproducibility
- focuses on validity of data analysis
- “Can we trust this analysis?”
- reproduce results new investigators, same data, same methods
- important when replication is impossible
- arguably a minimum standard for any scientific study
Background and Underlying Trend for Reproducibility
- some studies cannot be replicated (money/time/unique/opportunistic) \(\rightarrow\) 25 year studies in epidemiology
- technology increases rate of collection/complexity of data
- existing databases merged to become bigger databases (sometimes used off-label) \(\rightarrow\) administrative data used in health studies
- computing power allows for more sophisitcated analyses
- “computational” fields are ubiquitous
- all of the above leads to the following
- analysis are difficult to describe
- inadequate training in statistics/computing and cause errors to be more easily produced through the pipeline
- large data throughput, complex analysis, and failure to explain the process can inhibit knowledge transfer
- results are difficult to replicate (knowledge/cost)
- complicated analyses are not as easily trusted
Problems with Reproducibility
- aiming for reproducibility provides for
- transparency of data analysis
- check validity of analysis \(\rightarrow\) people check each other \(\rightarrow\) “self-correcting” community
- re-run the analysis
- check the code for bugs/errors/sensitivity
- try alternate approaches
- data/software/methods availability
- improved transfer of knowledge
- focuses on “downstream” aspect of research dissemination (does solve problems with erroneous analysis)
- reproducibility does not guarantee
- correctness of analysis (reproducible \(\neq\) true)
- trustworthiness of analysis
- deterrence of bad analysis
Evidence-based Data Analysis
- create analytic pipelines from evidence-based components â standardize it (“Deterministic Statistical Machine”)
- apply thoroughly-studied (via statistical research), mutually-agreed-upon methods to analyze data whenever possible
- once pipeline is established, shouldnât change it (âtransparent boxâ)
- this reduces the “researcher degrees of freedom” (restricting the ability to alter different parts of analysis pipeline)
- analogous to a pre-specified clinical trial protocol
- example: time series of air pollution/health
- key question: “Are short-term changes in pollution associated with short-term changes in a population health outcome?”
- long history of statistical research investigating proper methods of analysis
- pipeline
- check for outliers, high leverage, overdispersion \(\rightarrow\) skewedness of data
- Fill in missing data? \(\rightarrow\) NO! (systematic missing data, imputing would add noise)
- model selection: estimate degrees of freedom to adjust for unmeasured confounding variables
- multiple lag analysis
- sensitivity analysis with respect to unmeasured confounder adjustment/influential points
- this provides standardized, best practices for given scientific areas and questions
- gives reviewers an important tool without dramatically increasing the burden on them
- allows more effort to be focused on improving the quality of “upstream” aspects of scientific research
Caching Computations
- Literate (Statistical) Programming
- paper/code chunks \(\rightarrow\) database/cached computations \(\rightarrow\) published report

cacher
Package
- evaluates code in files and stores intermediate results in a key-value database
- R expressions are given SHA-1 hash values so changes can be tracked and code can be re-evaluated if necessary
- cacher packages
- used by authors to create data analyses packages for distribution
- used by readers to clone analysis and inspect/evaluate subset of code/data objects \(\rightarrow\) readers may not have resources/want to see entire analysis
- conceptual model
- dataset/code \(\rightarrow\) source file \(\rightarrow\) result
- process for authors
cacher
package parses R source files and creates necessary cache directories/subdirectories
- if expression = never evaluated
- evaluate it and store any results R objects in cached data (similar to
cache = TRUE
function for knitr
)
- else if cached result exists
- lazy load the results from cache database and move on to next expression
- if expression does not create any R objects (nothing to cache)
- add expression to list where evaluation needs to be forced
- write out metadata for this expression to metadata file
cachepackage
function creates cacher
package storing
- source file
- cached data objects
- metadata
- package file is zipped and can be distributed
- process for readers
- a journal article can say that “the code and data for this analysis can be found in the cacher package 092dcc7dda4b93e42f23e038a60e1d44dbec7b3f”
- the code is the SHA-1 hash code and can be loaded/cloned using the
cacher
package
- when analysis is cloned, local directories are created, source files/metadata are downloaded
- data objects are NOT downloaded by default
- references to data objects are loaded and corresponding data can be lazy-loaded on demand
- when user examines it, then it is downloaded/calculated
- readers are free to examine and evaluate the code
clone(id = "####")
= loads data from cache
- using the first 4 characters is generally enough to identify
showfiles()
= lists R scripts available in cache
sourcefile("name.R")
= loads cached R file
code()
= prints the content of the R file line by line
graphcode()
= plots a graph to demonstrate dependencies/structure of code
objectcode("object")
= shows lines of code that were used to generate that specific object (tracing all the way back to reading data)
runcode()
= executes code by loading data from cached database (much faster than regular)
- by default, expressions that result in object being created are NOT run and are loaded from cached databases
- expressions not resulting in object ARE evaluated (i.e. plots)
checkcode()
= evaluates all expressions from scratch
- results of evaluation are checked against stored results to see if they are the same
- setting random number generators are critical for this to work
checkobjects()
= check for integrity of data objects (i.e. see if there are possible data corruption)
loadcache()
= loads pointers to data objects in the data base
- when a specific object is called, cache is then transferred and the data is downloaded
- example
library(cacher)
clonecache(id = "092dcc7dda4b93e42f23e038a60e1d44dbec7b3f")
clonecache(id = "092d") ## effectively the same as above
# output: created cache directory '.cache'
showfiles() # show files stored in cache
# output: [1] "top20.R"
sourcefile("top20.R") # load R script
code() # examine the content of the code
# output:
# source file: top20.R
# 1 cities <- readLines("citylist.txt")
# 2 classes <- readLines("colClasses.txt")
# 3 vars <- c("date", "dow", "death",
# 4 data <- lapply(cities, function(city) {
# 5 names(data) <- cities
# 6 estimates <- sapply(data, function(city) {
# 7 effect <- weighted.mean(estimates[1,
# 8 stderr <- sqrt(1/sum(1/estimates[2,
graphcode() # generate graph showing structure of code

objectcode(âdataâ)
# output:
# source file: top20.R
# 1 cities <- readLines("citylist.txt")
# 2 classes <- readLines("colClasses.txt")
# 3 vars <- c("date", "dow", "death", "tmpd", "rmtmpd", "dptp", "rmdptp", "l1pm10tmean")
# 4 data <- lapply(cities, function(city) {
# filename <- file.path("data", paste(city, "csv", sep = "."))
# d0 <- read.csv(filename, colClasses = classes, nrow = 5200)
# d0[, vars]
# })
# 5 names(data) <- cities
loadcache()
ls()
# output:
# [1] "cities" "classes" "data" "effect"
# [5] "estimates" "stderr" "vars"
cities
# output:
# / transferring cache db file b8fd490bcf1d48cd06...
# [1] "la" "ny" "chic" "dlft" "hous" "phoe"
# [7] "staa" "sand" "miam" "det" "seat" "sanb"
# [13] "sanj" "minn" "rive" "phil" "atla" "oakl"
# [19] "denv" "clev"
effect
# output:
# / transferring cache db file 584115c69e5e2a4ae5...
# [1] 0.0002313219
stderr
# output:
# / transferring cache db file 81b6dc23736f3d72c6...
# [1] 0.000052457
\(\pagebreak\)
Case Study: Air Pollution
- reanalysis of data from MMAPS and link with PM chemical constituent data
- Lippmann et al. found strong evidence that Ni modified the short-term effect of \(PM_{10}\) across 60 US communities
- National Morbidity, Mortality, and Air Pollution Study (NMMAPS) = national study of the short-term health effects of ambient air pollution
- focused primarily on particulate matter (\(PM_{10}\)) and ozone (\(O_3\))
- health outcomes included mortality from all causes and hospitalizations for cardiovascular and respiratory diseases
- Data made available at the Internet-based Health and Air Pollution Surveillance System (iHAPSS)

Analysis: Does Nickel Make PM Toxic?
- Long-term average nickel concentrations appear correlated with PM risk (\(p < 0.01\) \(\rightarrow\) statistically significant)
- there appear to be some outliers on the right-hand side (New York City)
- adjusting the data by removing the New York counties altered the regression line (shown in blue)
- though the relationship is still positive, regression line no longer statistically significant (\(p < 0.31\))

- sensitivity analysis shows that the regression line is particularly sensative to the New York data

Conclusions and Lessons Learnt
- New York havs very high levels of nickel and vanadium, much higher than any other US community
- there is evidence of a positive relationship between Ni concentrations and \(PM_{10}\) risk
- strength of this relationship is highly sensitive to the observations from New York City
- most of the information in the data is derived from just 3 observations
- reproducibility of NMMAPS allowed for a secondary analysis (and linking with PM chemical constituent data) investigating a novel hypothesis (Lippmann et al.), as well as a critique of that analysis and new findings (Dominici et al.)
- original hypothesis not necessarily invalidated, but evidence not as strong as originally suggested (more work should be done)
\(\pagebreak\)
Case Study: High Throughput Biology

- Potti et al. published in 2006 that microarray data from cell lines (the NCI60) can be used to define drug response “signatures”, which can be used to predict whether patients will respond
- however, analysis performed were fundamentally flawed as there exist much misclassification and mishandling of data and were unreproducible
- the results of their study, even after being corrected twice, still contained many errors and were unfortunately used as guidelines for clinical trials
- the fiasco with this paper and associated research, though spectacular in its own light, was by no means an unique occurrence
Conclusions and Lessons Learnt
- most common mistakes are simple
- confounding in the experimental design
- mixing up the sample labels
- mixing up the gene labels
- mixing up the group labels
- most mixups involve simple switches or offsets
- this simplicity is often hidden due to incomplete documentation
- research papers should include the following (particularly those that would potentially lead to clinical trials)
- data (with column names)
- provenance (who owns what data/which sample is which)
- code
- descriptions of non-scriptable steps
- descriptions of planned design (if used)
- reproducible research is the key to minimize these errors
- literate programming
- reusing templates
- report structure
- executive summaries
- appendices (sessionInfo, saves, file location)
Case Study
Pokemon. (Fire and not fire) or Rare and not rare
Plankton. Areas of Plankton and not areas of high plankton growth
level <- sample(1:99, 10, replace=T) hp <- sample(11:50, 10, replace=T) atk <- rnorm(10, 10, 4) data.frame(level, hp, atk)
sample(1:151, 6)