Overview

Downloading Files

Reading Excel Files

Reading XML Files

<pokedex> 
  <party> 
    <name>Squirtle</name>
    <level>10</level>
    <type>Water</type>
    <move number="1">tackle</move>
    <move number="2">bubble</move>
  </party>
  <party>
    <name>Charmander</name>
    <level>10</level>
    <type>Water</type>
    <move number="1">scratch</move> # 'move' tag with 'number' attribute
    <move number="2">ember</move>
  </party>
</pokedex>
library(XML)
## Warning: package 'XML' was built under R version 3.1.3
doc <- xmlTreeParse("pokedex.xml", useInternal = TRUE)
rootNode <- xmlRoot(doc)
xmlName(rootNode)
## [1] "pokedex"
names(rootNode) # two pokemon elements are tagged by 'party'
##   party   party 
## "party" "party"
rootNode[[1]] # the first element
## <party>
##   <name>Squirtle</name>
##   <level>10</level>
##   <type>Water</type>
##   <move number="1">tackle</move>
##   <move number="2">bubble</move>
## </party>
rootNode[[1]][[1]]
## <name>Squirtle</name>

reading JSON

Reading data from the web / webscraping [tutorial] (https://drive.google.com/file/d/0B1YdO4YnMkAxd3dUclRLUWs2eEU/view?usp=embed_facebook)

Reading APIs

fruit <- c("apple", "pear", "mango") # character vector.
price <- c(30, 30, 90) # numeric vector
price
## [1] 30 30 90

fileUrl <- “http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_National_Pok%C3%A9dex_number” doc <- htmlTreeParse(fileUrl, useInternal = TRUE) rootNode <- xmlRoot(doc)

scores <- xpathSApply(doc, “//td”, xmlValue) head(scores) scores[[5]] sapply(scores, FUN = function(x) x[3]) scoresdf<- data.frame(scores) sb <- scoresdf[3:754,] sb<-data.frame(sb) names(scores)

tb <- readHTMLTable(fileUrl) head(tb) head(tb[[2]]) dt_1 <- tb[[2]] head(tb[[2]][[5]]) table(tb[[2]][[5]])

fileUrl<- “http://bulbapedia.bulbagarden.net/wiki/List_of_Pok%C3%A9mon_by_base_stats_(Generation_I)” tb2 <- readHTMLTable(fileUrl) head(tb2) dt_2 <- tb2[1] dt_2<- data.frame(dt_2) colnames(dt_1)[4] <- c(“name”) colnames(dt_2)[3] <- c(“name”)

merged <- merge(dt_1, dt_2, by.x = “id”)

library(plyr) joined <- join(dt_2[,1], dt_1[,1])

Subsetting data

Sorting data (tutorial):

country <- c("china", "afghanistan")
food <- c("apple", "orange")
df <- data.frame("V1"= rep(country, 2), "V2"= rep(food, each = 2))
df
##            V1     V2
## 1       china  apple
## 2 afghanistan  apple
## 3       china orange
## 4 afghanistan orange
merge(df[1:2,], df[3:4,], by="V1")
##            V1  V2.x   V2.y
## 1 afghanistan apple orange
## 2       china apple orange
df <- data.frame("country"= rep(country, 2), "food"= rep(food, each = 2), "tonnes" = sample(1:10, 4))
df
##       country   food tonnes
## 1       china  apple      1
## 2 afghanistan  apple      9
## 3       china orange     10
## 4 afghanistan orange      8
library(reshape2)
## Warning: package 'reshape2' was built under R version 3.1.3
dcast(df[,1:3], formula = country ~ food, value.var = "tonnes")
##       country apple orange
## 1 afghanistan     9      8
## 2       china     1     10

Summarising data

Fire Water
Male 1198 1493
Female 557 1278

Creating new variables

Reshaping data

##       name trainer level     type
## 1  Pikachu     Ash     5 electric
## 2   Staryu   Misty    20    water
## 3 Caterpie     Ash    14    grass
##       name trainer variable    value
## 1  Pikachu     Ash    level        5
## 2   Staryu   Misty    level       20
## 3 Caterpie     Ash    level       14
## 4  Pikachu     Ash     type electric
## 5   Staryu   Misty     type    water
## 6 Caterpie     Ash     type    grass
dcast(dfmelt, trainer~variable)
## Aggregation function missing: defaulting to length
##   trainer level type
## 1     Ash     2    2
## 2   Misty     1    1
dcast(df, name~trainer, value.var = "level", mean)
##       name Ash Misty
## 1 Caterpie  14   NaN
## 2  Pikachu   5   NaN
## 3   Staryu NaN    20
##   trainer mean
## 1     Ash  9.5
## 2   Misty 20.0
* `ddply(iris, .(Species), numcolwise(mean))` = runs a summary across all numeric columns. Result is dataframe with mean of each col.
      * `.(Species)` = variable to group data by.
* `spraySums<- ddply(InsectSprays, .(spray), summarize, sum = ave(count, FUN = sum))` = creates a data frame (2 columns) where each row is filled with the corresponding spray and sum (repeated multiple times for each group)
* the result can then be used and added to the dataset for analysis
* Other useful plyr func
    * `arrange` = fast reodering without `order()`
    * `mutate` = add new variable. I.E. Add summarised data to dataset.

Managing data with dplyr

tidyr

\(\pagebreak\)

lubridate

Cleaning data

# Replace apples with ones and oranges with twos. 
df <- data.frame("country"= rep(country, 2), "food"= rep(food, each = 2), "tonnes" = sample(1:10, 4))
df$food <- with(df, food <- factor(food, labels = c(1,2)))
df
##       country food tonnes
## 1       china    1      7
## 2 afghanistan    1      9
## 3       china    2      2
## 4 afghanistan    2      3

Editing Text Variables

Regular Expression

Would the real Slim Shady please stand up, please stand up, please stand up.

I said would the real Slim Shady please stand up, please stand up.

Now I'm the real Slim Shady...

Working with dates

Data Sources