Basics

value assignment

Some languages like C requires to specify the type of the variable first and then assign numbers, e.g. int abc; abc = 2;. But R does not require such thing. R automatically generates the variable and changes the variable type accordingly.

# the basic value assignment operator that always works
abc <- 2 # in this line, R creates the variable "abc" and assigns 2
abc
## [1] 2
# the easy-to-use operator that rarely, but not never, produces error
abc = 2
abc
## [1] 2
# you can also assign arrays
abc = c(1,2,3,4,5)
abc
## [1] 1 2 3 4 5
# assigning the integer sequence
abc = 1:5
abc
## [1] 1 2 3 4 5
abc = 3:(-2)
abc
## [1]  3  2  1  0 -1 -2
# you can use the function "rep" to create repeating vectors
abc = rep(2, 5)
abc
## [1] 2 2 2 2 2
abc = rep(c(1,2,3), 5)
abc
##  [1] 1 2 3 1 2 3 1 2 3 1 2 3 1 2 3
# you can use the function "seq" to assign an equispaced sequence
abc = seq(from=0, to=1, by=0.1)
abc
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0
abc = seq(from=0, to=1, length=11)
abc
##  [1] 0.0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1.0

writing functions

You can write the custom functions in R easily. The functions do not need to be written in separate files while some languages like Matlab requires to do so.

# to write a custom function, assign FUNCTION to a VARIABLE
return(x+y)
}
square = function(x) {
return(x^2)
}

# the return statement can be omitted. R returns what is written in the last line.
square_noReturnStatement = function(x) {
x^2
}

# to call a function, write the name and put arguments inside the bracket
abc = 3
add(x=abc, y=2)
## [1] 5
square(x=3)
## [1] 9
square_noReturnStatement(x=3)
## [1] 9
# you can omit the labels
add(abc,2)
## [1] 5
# of course you can assign the returned value to a variable
result = square(abc)
result
## [1] 9

By the way, now we can have better understanding of what we mean by R automatically changing the variable types:

# right now, a function is assigned in the variable "add":
add
## function(x,y) {
##   return(x+y)
## }
## <bytecode: 0x5622c44c77b0>
# we know it works as a function
add(abc, 2)
## [1] 5
# let's assign a number to "add":
add
## [1] 3
# then the variable "add" is no longer a function:
add(abc, 2)
## Error in add(abc, 2): could not find function "add"

local and parent environment

Consider the following code.

myFunc = function(x) {
xSquare = x^2
return(xSquare/3)
}

myFunc(x = 2)
## [1] 1.333333
xSquare
## Error in eval(expr, envir, enclos): object 'xSquare' not found

We see that xSquare is not defined, even though it should have been used in myFunc. What happens in R, and in many other programming languages, is the following. When we call myFunc, R creates a sub-environment (i.e. a storage space) and define xSquare there. Then, when it returns xSquare/3 and escape myFunc, it deletes the sub-environment. Therefore, xSquare is not stored anymore.

The sub-environment is like a child of the original environment, and so we call the original one as the parent environment in the perspective of sub-environment. The very top environment is called the global environment.

We call the variables in the current environment as “local variables”.

Now consider the following code.

myFunc = function(x) {
xSquare = x^2
return(xSquare/3)
}

myFunc2 = function(x) {
xInverse = 1/x
resultMyFunc = myFunc(xInverse)
return(resultMyFunc)
}

myFunc3 = function(x) {
xCube = x^3
resultMyFunc2 = myFunc2(xCube)
return(resultMyFunc2)
}

myFunc3(x = 2)
## [1] 0.005208333

When we call myFunc3 in the last line, R creates a sub-environment for myFunc3. Then, when myFunc3 calls myFunc2 within its code, R creates an environment for myFunc2 that is a sub-environment of myFunc3. Similarly, when myFunc2 calls myFunc, R creates another sub-environment.

When the function creates the sub-environment, it stores the arguments of the function in the environment. In the above example, every environment has x variable, although they will all have different values.

How much the sub-environment can access variables in the parent environment is different across languages. In R, the functions can read and change values of the variables in the parent environments.

myFuncValue = 0

myFunc5 = function() {

print(myFuncValue)
myFuncValue = 2

myFunc4 = function() {

print(myFuncValue)
myFuncValue = 1
print(myFuncValue)
}

myFunc4()
print(myFuncValue)
}

myFunc5()
## [1] 0
## [1] 2
## [1] 1
## [1] 2
print(myFuncValue)
## [1] 0

In the above code, myFuncValue is present in all of the environments. The function always first search for the local variable when it needs to access the variable.

Now let’s see what happens if we define myFunc4 not in the environment of myFunc5 but in the global environment.

myFuncValue = 0

myFunc4 = function() {

print(myFuncValue)
myFuncValue = 1
print(myFuncValue)
}

myFunc5 = function() {

print(myFuncValue)
myFuncValue = 2

myFunc4()
print(myFuncValue)
}

myFunc5()
## [1] 0
## [1] 0
## [1] 1
## [1] 2
print(myFuncValue)
## [1] 0

In R, we can also change the value of the variable in the parent environment using <<-.

# let's add lines that orders the functions to print variables of parent environments:

xx = 2

myFunc6 = function(x) {
xSquare = x^2
myFunc7()
return(xSquare)
}

myFunc7 = function() {
xx <<- 0
}

myFunc6(x = xx)
## [1] 4
xx
## [1] 0

However, accessing variables of the parent environment is not recommended, as it is very likely to produce errors and wrong results. If you need to access variables of the parent environment, always pass it to sub-environment by putting them as arguments.

a = 2
b = 3

myFunc8 = function(x) { # not recommended!
return(x + a + b)
}

myFunc9 = function(x,a,b) {
return(x + a + b) # as mentioned before, the function will use the "local" a and b.
}

matrix, array and data.frame

There may be other ways, but one way to create a matrix is the following.

mat = matrix(0, nrow=3, ncol=2)
mat
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
sampleArray = c(1,2,3,4,5,6)

mat = matrix(sampleArray, nrow=3, ncol=2)
mat
##      [,1] [,2]
## [1,]    1    4
## [2,]    2    5
## [3,]    3    6
mat = matrix(sampleArray, nrow=3, ncol=2, byrow = TRUE)
mat
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
## [3,]    5    6
# you can access the entries in a usual way.
# In some languages such as C, the first entry is labeled 0.
# In R, the first entry is 1.
mat[1,1]
## [1] 1
mat[3,2]
## [1] 6
mat[0,1]
## numeric(0)
# you can also access rows or columns
mat[3,]
## [1] 5 6
mat[,2]
## [1] 2 4 6
# you can also call the submatrices
mat[c(1,2),]
##      [,1] [,2]
## [1,]    1    2
## [2,]    3    4
# a useful trick
mat[c(1,3,3,3,2,1,3),]
##      [,1] [,2]
## [1,]    1    2
## [2,]    5    6
## [3,]    5    6
## [4,]    5    6
## [5,]    3    4
## [6,]    1    2
## [7,]    5    6

The array function with 2 dimensions also returns a matrix. It is also useful in producing 3 dimensional arrays.

arr = array(0, dim = c(3,2))
arr
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
arr = array(0, dim = c(3,2,4))
arr
## , , 1
##
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
##
## , , 2
##
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
##
## , , 3
##
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0
##
## , , 4
##
##      [,1] [,2]
## [1,]    0    0
## [2,]    0    0
## [3,]    0    0

The data.frame is a variable type that handles the datasets.

ctrs = c("U.S.", "Canada", "Mexico")
pops = c(318.9, 35.16, 122.3)

# you create a data frame as follows:
df = data.frame(Country=ctrs, Population=pops)
df
##   Country Population
## 1    U.S.     318.90
## 3  Mexico     122.30

The data.frame is a matrix with the data manipulation capabilities.

# you can access the entries like the matrix
df[3,2]
## [1] 122.3
# you can also access a column by its name. Use money symbol.
df$Population ## [1] 318.90 35.16 122.30 # you can also subset the dataset subDF = subset(df, Population >= 100) subDF ## Country Population ## 1 U.S. 318.9 ## 3 Mexico 122.3 subDF = subset(df, Population >= 100, select = c("Country")) subDF ## Country ## 1 U.S. ## 3 Mexico data.table We may use data.table intead of data.frame in the class. It is an enhancement of data.frame which is created by contributors of R community. data.table is not automatically installed with base R. We need to install a “package” for data.table, which is basically a collection of codes. We install the package in our computer by the following code. R uses internet and downloads the package from the R package database. # the name of the package that allows us to use data.table is "data.table". install.packages("data.table") Now we have installed the package. As we do when we install and use computer programs like MS Office, We need to “execute” or “load” the package to use it. We load the package by the following code: library(data.table) Now we discuss how data.frame works. It works like data.frame. # recall: ctrs ## [1] "U.S." "Canada" "Mexico" pops ## [1] 318.90 35.16 122.30 # create datatable dt = data.table(Country=ctrs, Population=pops) dt ## Country Population ## 1: U.S. 318.90 ## 2: Canada 35.16 ## 3: Mexico 122.30 We can also transform data.frame into data.table. # recall: df ## Country Population ## 1 U.S. 318.90 ## 2 Canada 35.16 ## 3 Mexico 122.30 class(df) ## [1] "data.frame" # transform. dt = as.data.table(df) dt ## Country Population ## 1: U.S. 318.90 ## 2: Canada 35.16 ## 3: Mexico 122.30 class(dt) ## [1] "data.table" "data.frame" data.table has additional features compared to data.frame. For example, to subset a data.table, do the following. # subset rows with pop >= 100 dt[Population >= 100] ## Country Population ## 1: U.S. 318.9 ## 2: Mexico 122.3 # subset rows but choose the country column only. dt[Population >= 100, Country] ## [1] U.S. Mexico ## Levels: Canada Mexico U.S. # if you want to be still in the data.table format: dt[Population >= 100, list(Country)] ## Country ## 1: U.S. ## 2: Mexico class(dt[Population >= 100, list(Country)]) ## [1] "data.table" "data.frame" We will see other features of data.table soon. For more information, look at https://github.com/Rdatatable/data.table/wiki/Getting-started, which is a webpage created by the author of data.table package. list list is a variable type that can store various sub-variables. It is like struct in C. abc = list(a=3, b="John Doe", c=123.456) # to access the sub-variables, call them using the money symbol: abc$a
## [1] 3
abc$b ## [1] "John Doe" abc$c
## [1] 123.456

Inside the list, you can save everything as a sub-variable. You can save functions, you can save matrices, you can save a list inside a list, and so on. The following example stores various model primitives of a matching model in Labor economics.

# model primitives
model = list(
r = 0.05, # interest rate
s = 0.02, # separation rate
matchFunc = function(u,v) { 0.4 * u^0.5 * v^0.5 }, # matching function
b = 0, # unemployment benefit
c = 0, # cost of vacancy
alpha = 0.5, # bargaining power
prodFunc = function(x,y) { (x^(-2) + y^(-2))^(1/(-2)) + 0.2 } # production function
)

We can call a function inside the list in a usual way:

model$prodFunc(1,2) ## [1] 1.094427 Instead of assigning names to the sub-variables, we can leave the names blank, in which case the serial numbers are assigned. abc = list(3, "John Doe", 123.456) abc ## [[1]] ## [1] 3 ## ## [[2]] ## [1] "John Doe" ## ## [[3]] ## [1] 123.456 # to access the elements of the list, use the double brackets: abc[[1]] ## [1] 3 abc[[2]] ## [1] "John Doe" abc[[3]] ## [1] 123.456 syntax If you are familiar with syntax of other languages such as matlab, stata, python, etc., the syntax is very much similar in R. For example, the for loop looks like this: # the canonical for loop with natural numbers sum = 0 for(i in 1:4) { sum = sum + i } sum ## [1] 10 # the loop works for ANY array greekLetterArray = c("Alpha", "Beta", "Gamma", "Delta") for(greekLetter in greekLetterArray) { print(greekLetter) } ## [1] "Alpha" ## [1] "Beta" ## [1] "Gamma" ## [1] "Delta" I will not spend time on the syntax, but you can use google to learn how to write a certain syntax, or you can come to the office hour. The plyr package The plyr package is a package for data manipulation. Let’s look at how the functions in the plyr package are used by example. First, let’s take a dataset from the MASS package. The MASS package is pre-installed in R, and so we can simply load it without installing it. library(MASS) # the MASS package contains a dataset named "Insurance" which is a dataset about the insurance claims. Insurance ## District Group Age Holders Claims ## 1 1 <1l <25 197 38 ## 2 1 <1l 25-29 264 35 ## 3 1 <1l 30-35 246 20 ## 4 1 <1l >35 1680 156 ## 5 1 1-1.5l <25 284 63 ## 6 1 1-1.5l 25-29 536 84 ## 7 1 1-1.5l 30-35 696 89 ## 8 1 1-1.5l >35 3582 400 ## 9 1 1.5-2l <25 133 19 ## 10 1 1.5-2l 25-29 286 52 ## 11 1 1.5-2l 30-35 355 74 ## 12 1 1.5-2l >35 1640 233 ## 13 1 >2l <25 24 4 ## 14 1 >2l 25-29 71 18 ## 15 1 >2l 30-35 99 19 ## 16 1 >2l >35 452 77 ## 17 2 <1l <25 85 22 ## 18 2 <1l 25-29 139 19 ## 19 2 <1l 30-35 151 22 ## 20 2 <1l >35 931 87 ## 21 2 1-1.5l <25 149 25 ## 22 2 1-1.5l 25-29 313 51 ## 23 2 1-1.5l 30-35 419 49 ## 24 2 1-1.5l >35 2443 290 ## 25 2 1.5-2l <25 66 14 ## 26 2 1.5-2l 25-29 175 46 ## 27 2 1.5-2l 30-35 221 39 ## 28 2 1.5-2l >35 1110 143 ## 29 2 >2l <25 9 4 ## 30 2 >2l 25-29 48 15 ## 31 2 >2l 30-35 72 12 ## 32 2 >2l >35 322 53 ## 33 3 <1l <25 35 5 ## 34 3 <1l 25-29 73 11 ## 35 3 <1l 30-35 89 10 ## 36 3 <1l >35 648 67 ## 37 3 1-1.5l <25 53 10 ## 38 3 1-1.5l 25-29 155 24 ## 39 3 1-1.5l 30-35 240 37 ## 40 3 1-1.5l >35 1635 187 ## 41 3 1.5-2l <25 24 8 ## 42 3 1.5-2l 25-29 78 19 ## 43 3 1.5-2l 30-35 121 24 ## 44 3 1.5-2l >35 692 101 ## 45 3 >2l <25 7 3 ## 46 3 >2l 25-29 29 2 ## 47 3 >2l 30-35 43 8 ## 48 3 >2l >35 245 37 ## 49 4 <1l <25 20 2 ## 50 4 <1l 25-29 33 5 ## 51 4 <1l 30-35 40 4 ## 52 4 <1l >35 316 36 ## 53 4 1-1.5l <25 31 7 ## 54 4 1-1.5l 25-29 81 10 ## 55 4 1-1.5l 30-35 122 22 ## 56 4 1-1.5l >35 724 102 ## 57 4 1.5-2l <25 18 5 ## 58 4 1.5-2l 25-29 39 7 ## 59 4 1.5-2l 30-35 68 16 ## 60 4 1.5-2l >35 344 63 ## 61 4 >2l <25 3 0 ## 62 4 >2l 25-29 16 6 ## 63 4 >2l 30-35 25 8 ## 64 4 >2l >35 114 33 # of course we can also use data.table dtInsurance = as.data.table(Insurance) dtInsurance ## District Group Age Holders Claims ## 1: 1 <1l <25 197 38 ## 2: 1 <1l 25-29 264 35 ## 3: 1 <1l 30-35 246 20 ## 4: 1 <1l >35 1680 156 ## 5: 1 1-1.5l <25 284 63 ## 6: 1 1-1.5l 25-29 536 84 ## 7: 1 1-1.5l 30-35 696 89 ## 8: 1 1-1.5l >35 3582 400 ## 9: 1 1.5-2l <25 133 19 ## 10: 1 1.5-2l 25-29 286 52 ## 11: 1 1.5-2l 30-35 355 74 ## 12: 1 1.5-2l >35 1640 233 ## 13: 1 >2l <25 24 4 ## 14: 1 >2l 25-29 71 18 ## 15: 1 >2l 30-35 99 19 ## 16: 1 >2l >35 452 77 ## 17: 2 <1l <25 85 22 ## 18: 2 <1l 25-29 139 19 ## 19: 2 <1l 30-35 151 22 ## 20: 2 <1l >35 931 87 ## 21: 2 1-1.5l <25 149 25 ## 22: 2 1-1.5l 25-29 313 51 ## 23: 2 1-1.5l 30-35 419 49 ## 24: 2 1-1.5l >35 2443 290 ## 25: 2 1.5-2l <25 66 14 ## 26: 2 1.5-2l 25-29 175 46 ## 27: 2 1.5-2l 30-35 221 39 ## 28: 2 1.5-2l >35 1110 143 ## 29: 2 >2l <25 9 4 ## 30: 2 >2l 25-29 48 15 ## 31: 2 >2l 30-35 72 12 ## 32: 2 >2l >35 322 53 ## 33: 3 <1l <25 35 5 ## 34: 3 <1l 25-29 73 11 ## 35: 3 <1l 30-35 89 10 ## 36: 3 <1l >35 648 67 ## 37: 3 1-1.5l <25 53 10 ## 38: 3 1-1.5l 25-29 155 24 ## 39: 3 1-1.5l 30-35 240 37 ## 40: 3 1-1.5l >35 1635 187 ## 41: 3 1.5-2l <25 24 8 ## 42: 3 1.5-2l 25-29 78 19 ## 43: 3 1.5-2l 30-35 121 24 ## 44: 3 1.5-2l >35 692 101 ## 45: 3 >2l <25 7 3 ## 46: 3 >2l 25-29 29 2 ## 47: 3 >2l 30-35 43 8 ## 48: 3 >2l >35 245 37 ## 49: 4 <1l <25 20 2 ## 50: 4 <1l 25-29 33 5 ## 51: 4 <1l 30-35 40 4 ## 52: 4 <1l >35 316 36 ## 53: 4 1-1.5l <25 31 7 ## 54: 4 1-1.5l 25-29 81 10 ## 55: 4 1-1.5l 30-35 122 22 ## 56: 4 1-1.5l >35 724 102 ## 57: 4 1.5-2l <25 18 5 ## 58: 4 1.5-2l 25-29 39 7 ## 59: 4 1.5-2l 30-35 68 16 ## 60: 4 1.5-2l >35 344 63 ## 61: 4 >2l <25 3 0 ## 62: 4 >2l 25-29 16 6 ## 63: 4 >2l 30-35 25 8 ## 64: 4 >2l >35 114 33 ## District Group Age Holders Claims # some useful functions for looking at the dataset head(Insurance, n=5) # you can omit n, in which case it is set to 6. ## District Group Age Holders Claims ## 1 1 <1l <25 197 38 ## 2 1 <1l 25-29 264 35 ## 3 1 <1l 30-35 246 20 ## 4 1 <1l >35 1680 156 ## 5 1 1-1.5l <25 284 63 tail(Insurance) # ditto ## District Group Age Holders Claims ## 59 4 1.5-2l 30-35 68 16 ## 60 4 1.5-2l >35 344 63 ## 61 4 >2l <25 3 0 ## 62 4 >2l 25-29 16 6 ## 63 4 >2l 30-35 25 8 ## 64 4 >2l >35 114 33 str(Insurance) # displays structure of the data.frame ## 'data.frame': 64 obs. of 5 variables: ##$ District: Factor w/ 4 levels "1","2","3","4": 1 1 1 1 1 1 1 1 1 1 ...
##  $Group : Ord.factor w/ 4 levels "<1l"<"1-1.5l"<..: 1 1 1 1 2 2 2 2 3 3 ... ##$ Age     : Ord.factor w/ 4 levels "<25"<"25-29"<..: 1 2 3 4 1 2 3 4 1 2 ...
##  $Holders : int 197 264 246 1680 284 536 696 3582 133 286 ... ##$ Claims  : int  38 35 20 156 63 84 89 400 19 52 ...

To begin discussion, let’s install plyr.

install.packages("plyr")

summarizing data

The plyr package has useful functions that summarize information in data.frame. One useful function is ddply which can generate various summary statistics of a data.frame.

# let's load the package
library(plyr)

# this command computes average number of holders and claims for each district
ddply(.data = Insurance, .variables = .(District), .fun = summarize, meanHolders=mean(Holders), meanClaims=mean(Claims))
##   District meanHolders meanClaims
## 1        1    659.0625    86.3125
## 2        2    415.8125    55.6875
## 3        3    260.4375    34.5625
## 4        4    124.6250    20.3750
# this command computes total number of holders and claims for each age group
ddply(.data = Insurance, .variables = .(Age), .fun = summarize, totHolders=sum(Holders), totClaims=sum(Claims))
##     Age totHolders totClaims
## 1   <25       1138       229
## 2 25-29       2336       404
## 3 30-35       3007       453
## 4   >35      16878      2065
# we can also summarize numbers for each district+age group:
ddply(.data = Insurance, .variables = .(District, Age), .fun = summarize, totHolders=sum(Holders), totClaims=sum(Claims))
##    District   Age totHolders totClaims
## 1         1   <25        638       124
## 2         1 25-29       1157       189
## 3         1 30-35       1396       202
## 4         1   >35       7354       866
## 5         2   <25        309        65
## 6         2 25-29        675       131
## 7         2 30-35        863       122
## 8         2   >35       4806       573
## 9         3   <25        119        26
## 10        3 25-29        335        56
## 11        3 30-35        493        79
## 12        3   >35       3220       392
## 13        4   <25         72        14
## 14        4 25-29        169        28
## 15        4 30-35        255        50
## 16        4   >35       1498       234
# also see what happens when we put "mutate" instead of "summarize":
ddply(.data = Insurance, .variables = .(District, Age), .fun = mutate, totHolders=sum(Holders), totClaims=sum(Claims))
##    District  Group   Age Holders Claims totHolders totClaims
## 1         1    <1l   <25     197     38        638       124
## 2         1 1-1.5l   <25     284     63        638       124
## 3         1 1.5-2l   <25     133     19        638       124
## 4         1    >2l   <25      24      4        638       124
## 5         1    <1l 25-29     264     35       1157       189
## 6         1 1-1.5l 25-29     536     84       1157       189
## 7         1 1.5-2l 25-29     286     52       1157       189
## 8         1    >2l 25-29      71     18       1157       189
## 9         1    <1l 30-35     246     20       1396       202
## 10        1 1-1.5l 30-35     696     89       1396       202
## 11        1 1.5-2l 30-35     355     74       1396       202
## 12        1    >2l 30-35      99     19       1396       202
## 13        1    <1l   >35    1680    156       7354       866
## 14        1 1-1.5l   >35    3582    400       7354       866
## 15        1 1.5-2l   >35    1640    233       7354       866
## 16        1    >2l   >35     452     77       7354       866
## 17        2    <1l   <25      85     22        309        65
## 18        2 1-1.5l   <25     149     25        309        65
## 19        2 1.5-2l   <25      66     14        309        65
## 20        2    >2l   <25       9      4        309        65
## 21        2    <1l 25-29     139     19        675       131
## 22        2 1-1.5l 25-29     313     51        675       131
## 23        2 1.5-2l 25-29     175     46        675       131
## 24        2    >2l 25-29      48     15        675       131
## 25        2    <1l 30-35     151     22        863       122
## 26        2 1-1.5l 30-35     419     49        863       122
## 27        2 1.5-2l 30-35     221     39        863       122
## 28        2    >2l 30-35      72     12        863       122
## 29        2    <1l   >35     931     87       4806       573
## 30        2 1-1.5l   >35    2443    290       4806       573
## 31        2 1.5-2l   >35    1110    143       4806       573
## 32        2    >2l   >35     322     53       4806       573
## 33        3    <1l   <25      35      5        119        26
## 34        3 1-1.5l   <25      53     10        119        26
## 35        3 1.5-2l   <25      24      8        119        26
## 36        3    >2l   <25       7      3        119        26
## 37        3    <1l 25-29      73     11        335        56
## 38        3 1-1.5l 25-29     155     24        335        56
## 39        3 1.5-2l 25-29      78     19        335        56
## 40        3    >2l 25-29      29      2        335        56
## 41        3    <1l 30-35      89     10        493        79
## 42        3 1-1.5l 30-35     240     37        493        79
## 43        3 1.5-2l 30-35     121     24        493        79
## 44        3    >2l 30-35      43      8        493        79
## 45        3    <1l   >35     648     67       3220       392
## 46        3 1-1.5l   >35    1635    187       3220       392
## 47        3 1.5-2l   >35     692    101       3220       392
## 48        3    >2l   >35     245     37       3220       392
## 49        4    <1l   <25      20      2         72        14
## 50        4 1-1.5l   <25      31      7         72        14
## 51        4 1.5-2l   <25      18      5         72        14
## 52        4    >2l   <25       3      0         72        14
## 53        4    <1l 25-29      33      5        169        28
## 54        4 1-1.5l 25-29      81     10        169        28
## 55        4 1.5-2l 25-29      39      7        169        28
## 56        4    >2l 25-29      16      6        169        28
## 57        4    <1l 30-35      40      4        255        50
## 58        4 1-1.5l 30-35     122     22        255        50
## 59        4 1.5-2l 30-35      68     16        255        50
## 60        4    >2l 30-35      25      8        255        50
## 61        4    <1l   >35     316     36       1498       234
## 62        4 1-1.5l   >35     724    102       1498       234
## 63        4 1.5-2l   >35     344     63       1498       234
## 64        4    >2l   >35     114     33       1498       234

The function summarize is a function of the plyr package that computes its followed arguments, like totHolders=sum(Holders), totClaims=sum(Claims) in the above example, for each subgroup specified in .variables.

Of course, you can use functions other than summarize. The following example computes the mean absolute deviation for each district.

# a brute-force way is the following.
# note that the "ddply" first subsets the data and give the subset to a function as its argument.
# For example, a subset will look like this:
subset(Insurance, District == 1)
##    District  Group   Age Holders Claims
## 1         1    <1l   <25     197     38
## 2         1    <1l 25-29     264     35
## 3         1    <1l 30-35     246     20
## 4         1    <1l   >35    1680    156
## 5         1 1-1.5l   <25     284     63
## 6         1 1-1.5l 25-29     536     84
## 7         1 1-1.5l 30-35     696     89
## 8         1 1-1.5l   >35    3582    400
## 9         1 1.5-2l   <25     133     19
## 10        1 1.5-2l 25-29     286     52
## 11        1 1.5-2l 30-35     355     74
## 12        1 1.5-2l   >35    1640    233
## 13        1    >2l   <25      24      4
## 14        1    >2l 25-29      71     18
## 15        1    >2l 30-35      99     19
## 16        1    >2l   >35     452     77
# and then, this subset becomes the "x" in the following function that we will use:

holderMAD = mean(abs(x$Holders - mean(x$Holders)))
}

# look at the result of applying the "ddply" function:
ddply(.data = Insurance, .variables = ~District, .fun = computeMAD)
##   District       V1
## 1        1 620.2188
## 2        2 404.9688
## 3        3 274.2109
## 4        4 126.2656
# if you don't like the "V1" label, we can modify the above code as follows:

holderMAD = mean(abs(x$Holders - mean(x$Holders)))
}
ddply(.data = Insurance, .variables = ~District, .fun = computeMAD)
##   District MADofHolders
## 1        1     620.2188
## 2        2     404.9688
## 3        3     274.2109
## 4        4     126.2656

The following code seems to return the same result, but we will see the difference later.

# The same result can be obtained by using "summarize" and a more general function:
}

ddply(.data = Insurance, .variables = ~District, .fun = summarize, MADofHolders=computeMAD(Holders))
##   District MADofHolders
## 1        1     620.2188
## 2        2     404.9688
## 3        3     274.2109
## 4        4     126.2656

data.table allows to do the same thing by a simple syntax. The rule is the following:

nameOftheDataTable[,functionToUse,by=groupingVariable]

For example, the following two codes provide the same result.

ddply(.data = Insurance, .variables = .(District), .fun = summarize, meanHolders=mean(Holders), meanClaims=mean(Claims))
##   District meanHolders meanClaims
## 1        1    659.0625    86.3125
## 2        2    415.8125    55.6875
## 3        3    260.4375    34.5625
## 4        4    124.6250    20.3750
dtInsurance[,.(meanHolders=mean(Holders), meanClaims=mean(Claims)), by=District]
##    District meanHolders meanClaims
## 1:        1    659.0625    86.3125
## 2:        2    415.8125    55.6875
## 3:        3    260.4375    34.5625
## 4:        4    124.6250    20.3750

transforming data

Sometimes we want to transform the data.frame into the matrix or the array form. This is done by daply, where the prefix da means transforming data.frame to array. In fact, the name ddply means transformation from data.frame to another data.frame.

Below is an example of transforming a data.frame into a matrix.

computeTotHolders = function(x) {