Mostrando entradas con la etiqueta R. Mostrar todas las entradas
Mostrando entradas con la etiqueta R. Mostrar todas las entradas

lunes, 1 de mayo de 2017

Aventura de Regresión Lineal y el método de eliminación hacia atrás paso a paso

En esta entrada, haremos el ejercicio de regresión lineal planteado en el semestre 2012-1, cuyo trabajo práctico es igual al de este semestre (2017-1). Se hace uso del R y mi librería estUNA para construir modelos de regresión lineal mediante el método de eliminación hacia atrás. En este proceso, enfatizamos la importancia de realizar un análisis de residuos, entre otras cosas para sugerirnos posibles transformaciones de las variables con las que podamos mejorar los modelos de regresión.

jueves, 6 de abril de 2017

estUNA

¿Qué es estUNA?


estUNA es una librería en R elaborada por mí para ampliar algunas funcionalidades del lenguaje R y facilitar la elaboración de los trabajos de estadística y el estudio independiente y a distancia de las materias de probabilidades y estadística de la Universidad Nacional Abierta en general. La librería contiene funciones para agrupar datos, generar resúmenes de datos agrupados y no agrupados, varios tipos de gráficas (histogramas, tortas, boxplots), contrastes de bondad de ajuste y de independencia chi-cuadrado, análisis de regresión, entre otras. La librería se actualiza cada semestre para incluir la data del trabajo práctico.

Eventualmente será publicada en el repositorio CRAN como un paquete. Actualmente, el archivo imagen (que permite trabajar con la librería) está disponible para su descarga en https://raw.githubusercontent.com/unamatematicaseltigre/estUNA/master/estUNA.

Introducción al R

es un entorno de programación

En esta página se hará una brevísima introducción al lenguaje R como entorno de programación. Sin pretender que esto sea una guía completa, se exponen los conceptos necesarios para poder utilizar este lenguaje como complemento instruccional a los cursos de estadística y probabilidades de la Universidad Nacional Abierta.

Un entorno de programación es una aplicación que permite crear, ejecutar y depurar programas. Los programas son esencialmente secuencias de instrucciones que le indican al computador de manera muy precisa lo que este debe hacer. Estas instrucciones se especifican en algo llamado lenguaje de programación y cada lenguaje de programación tiene su "gramática" particular y sus reglas de sintaxis. R es un lenguaje de programación interpretado, lo cual quiere decir que el programador ingresa instrucciones a través de una consola y el interprete de R va procesando cada instrucción a medida que esta se ingresa y va dando la salida respectiva a cada instrucción de forma secuencial.

En vez de escribir las instrucciones una por una en la consola, podemos indicar la secuencia de instrucciones que queremos ejecutar a través de un archivo de texto (cómo los que creamos cuando usamos el bloc de notas). Esto es lo que se conoce como un script. Un script es una especie de programa que necesita siempre de un interprete para poderse ejecutar. En esta guía, aprenderemos a crear nuestros propios scripts.

viernes, 23 de diciembre de 2016

Feliz Navidad 2016 (en R)

Quisiera desearle a mis lectores una feliz Navidad 2016 ... al estilo R. Vacílense mi arbolito navideño 3d creado usando el paquete plot3D de R:



Aunque esto sería aún otra decoración navideña hecha en R, lo original de esta es que el árbol se muestra en perspectiva tridimensional. Hace un tiempo, yo mismo elaboré un código en R para visualizar un árbol navideño en dos dimensiones (ver esta entrada), el cual a su vez era basado en el código que apareció en el blog Wiekvoet. Sin embargo, esta vez el reto era crear un modelo tridimensional del árbol visualizado en perspectiva con una animación mostrando el árbol en rotación.

jueves, 22 de diciembre de 2016

Merry Christmas 2016 (with R)

I'd like to wish all my readers a Merry Christmas 2016- R style! Behold my 3d Christmas tree created using the plot3D R package:



While this might seem like yet another Christmas decoration done in R, it is unique in that the tree is rendered in 3d perspective. I myself wrote some code for a 2d Christmas tree a while ago (see this entry in Spanish), in turn based on the code in the Wiekvoet blog. This time around though, I took it a bit further- the challenge now being to create a 3d model of a Christmas tree that I could rotate and animate in perspective. Heck, if others with enough leisure/boredom time to spare were to join in and showcase their own Christmas R creations for this time of the year, we might be witnessing the start of a new Christmas tradition for all R geekdom across the planet.

Mathematical model of a Christmas tree

Long before the birth of Jesus, most pagan cultures in Europe were sun worshipers and in December they celebrated the rebirth of the sun and the rich harvests. This is the significance of the Christmas tree in these festivities of fire and light, abounding in evergreens and holly branches. In Spanish America, instead of Christmas trees, people would place Nativity scenes (Nacimientos) in their homes, but with globalization and the prevalence of the Anglo-Saxon culture, the Christmas tree eventually found its way into our homes. At any rate, I for one have always found the lights and the bright decorations of the Christmas tree captivating and reminiscent of happier moments when friends are remembered and grudges are forgotten. So in a sort of weird tribute to the spirit of these festivities, I decided I would make a mathematical model of the Christmas tree. How geeky is that?

Mathematically, a Christmas tree is nothing but a fractal. A fractal consisting of a stub or segment of a tree trunk with two or three branches that separate outward from the trunk, and an extension, or outward projection of the trunk in the same direction as the trunk. The branches can in turn be considered as trunk segments, each comprising more branches and an extension outward in the same direction. The extension or outward projection of a trunk is in turn another stub with more branching off points and so on. Consequently, the tree construction process is recursive. Being a descendant of LISP, base R supports the list as the data structure of choice for implementing this recursive definition of a tree.

There are, to be sure, some considerations to be made so that this recursively-built tree ends up looking like a pine tree and not a palm tree, a cherry blossom tree, a mango tree or a bonzai tree. For one thing, pine trees are characteristically conical. This means that the angle of separation of the branches from the trunk is initially larger but "closes up" as the pine tree extends out or up. To control the degree of "outwardness" or "upness" associated to a particular tree-trunk stub there will be a depth parameter. Initially, the pine tree starts as one little green trunk stub of depth 1. As this stub branches off and extends upward, each of the children stubs has a lower depth than the parent stub. The most outward branches and leafs of the tree have a depth of almost 0. This will be useful for the dressing the tree with lights and decorations, as these lights are usually found on the more superficial or outward parts of the tree.

Another aspect of our tree growth model worth mentioning is how the tree trunk becomes thicker as the tree grows out an up. This is done via a width parameter that will eventually determine how thick the lines representing the trunk and the branches should be drawn. This width parameter also determines when the tree stem changes color from dark green to brown. In each growth cycle, the growth algorithm increases the width of each stub by a factor of 1.4, it then goes through the branches and the extension looking for the ending stubs (those whose branches and extensions are NULL or nonexistent). It then adds the branches and the extension and backtracks down to find the remaining end stubs. The width of the new branches and extensions is initialized to 1, but as can be seen, the width of the older stubs has been increased more.

The Christmas tree data structure is a list whose components are:

  • The starting point of the trunk segment, given as a vector with coordinates in \(\mathbb{R}^3\) as \((x_0,y_0,z_0)\).
  • The ending point of the trunk segment, as given by the vector with coordinates \((x_1,y_1,z_1)\). Together, the start and end points determine the direction vector of the tree stem as \(\vec{u}=(x_1-x_0,y_1-y_0,z_1-z_0)\). This direction vector will be useful for creating the extension stub, since the extension stub grows in the same direction as the parent stub. It is also used for determining where along the stub the branches start off and in what direction those branches are created.
  • The width parameter lwd which is also the thickness with which the tree stem is drawn as a segment when plotted.
  • The depth, which indicates how outward a branch or tree stem is, as already explained above.
  • Slots for three branches and one extension (branch1, branch2, branch3 and extension), which are nothing but lists recursively defined like this one. When a branch or extension is created, these slots are initialized to NULL.
Creating the extension of a tree stem poses no major problem. One simply takes as starting point of the extension the ending point of the parent tree stem. The ending point of the extension is determined by adding to its starting point a multiple of the direction vector \(\vec{u}=(x_1-x_0,y_1-y_0,z_1-z_0)\). This scalar multiple is such that the resulting length of the extension stem is slightly shorter than that of the parent stem. The most difficult issue in creating the Christmas tree lies with the creation of the branches. How can we create the branches so that they branch out from the stem in apparently uniform angles around the stem and not have all the branches branch out from the same side of the stem? How can we obtain the direction vectors for those branches so that they extend outwards from the stem? A little trigonometry and vector geometry is of use to us here.

If the \(\vec{u}\) vector is the direction of the tree stem being considered, then what we need is a vector perpendicular to \(\vec{u}\) - let's call it \(\vec{v}\) - so that we can obtain the direction vector of the branch \(\vec{b}\) by adding \(\vec{u}+\vec{v}\) and then multiplying \(\vec{b}\) by a scalar to set its length to the desired magnitude (see Fig. 1). However, considering we're dealing with the \(\mathbb{R}^3\) metric space, there are infinitely many such vectors. In fact, there is an entire two-dimensional space - a plane \(\mathcal{V}\) comprised of vectors that are all perpendicular to $\vec{u}$ (see Fig. 2). So what we need to obtain $\vec{v}$ is a orthonormal base of two vectors \(\{\vec{v_1},\vec{v_2}\}\) that will permit us to express \(\vec{v}\) as a linear combination of the two base vectors of \(\mathcal{V}\).

Fig. 1 Geometrical relationship between the main tree stem (\(\vec{u}\)) and one of its branches (\(\vec{b}\)). \(\theta\) is the angle at which \(\vec{b}\) branches outward from the main stem.

This is easy enough. Two vectors are said to be orthogonal if their dot-product is zero. So if we have a vector, say \((x,y,z)\) all we need to do is pick any non-zero coordinate and switch it with one of the remaining coordinates, changing one of the signs to negative and setting the remaining third coordinate to zero. So for example \((-z,0,x)\), \((0,z,-y)\), and \((-y,x,0)\) are all perpendicular to \((x,y,z)\), provided they are all non-null vectors. In short, to find an orthonormal base for the plane perpendicular to a vector \(\vec{u}\), pick any non-zero component of this vector, pick a second component to interchange it with while changing the sign of any of the two and set the third component to zero. This will give you a perpendicular vector to \(\vec{u}\). To find the other perpendicular vector, just repeat the above process, but interchanging the chosen non-zero component of \(\vec{u}\) with the component you had set to zero for the first vector. Finally, once you have your two vectors \(\vec{v_1}\) and \(\vec{v_2}\), you must multiply them by the inverse of their norms to set their norms to one. The procedure is better illustrated in the R code on this post.

Fig. 2 Finding vector \(\vec{v}\) as a linear combination of an orthonormal base of perpendicular plane \(\mathcal{V}\) and then using it to find the direction vector \(\vec{b}\) of the branch.

So once you have the base of the perpendicular plane to a given tree stem \(\vec{u}\), all you need is two scalar components \(c_1\) and \(c_2\) to define the perpendicular vector $v$ used for constructing the branch, as a linear combination of the base. How are we to go about this? Well, let's think about what we really need. For some parts of the tree - the deeper parts - we need three branches to a stem. For the more outward parts, two branches to a stem will suffice. We want these branches to be evenly distributed about the stem: in the case of 3 branches, we would like the angle of separation between the branches to be \(120^\circ\) (like the Mercedes-Benz star symbol), in the case of 2 branches, an angle of separation between them of about \(180^\circ\pm 20^\circ\) should be adequate.

What this means is that we want to choose the \(\vec{v}\) vectors in \(\mathcal{V}\) to have the sort of layout shown in Fig. 3. This is to ensure the pine tree will have a nice conical shape and not have its branches all projecting to the same side.

Fig. 3a 3 branch defining vectors Fig. 3b 2 branch defining vectors

Fig. 3 Distribution of two and three branch generating vectors on \(\mathcal{V}\)


Our problem now is to generate 2 or 3 vectors on a plane with such angles of separations. The good news is that we can define the vectors on the regular 2d plane with the canonical base \(\{\hat{i},\hat{j}\}\), and, because linear algebra is so cool, we can use the components of those vectors as \(c_1\) and \(c_2\) - the scalar coeficients used to define the \(\vec{v}\) vectors as linear combinations of \(\vec{v_1}\) and \(\vec{v_2}\). This is so because \(\{\vec{v_1},\vec{v_2}\}\) is an orthonormal base of a plane in the 3d space. So for example if the angle between \((0,1)\) and \((-\tfrac{1}{2},\tfrac{\sqrt{3}}{2})\) is \(120^\circ\), then the angle between \(\vec{v_2}\) and \(-\tfrac{1}{2}\vec{v_1} + \tfrac{\sqrt{3}}{2}\vec{v_2}\) will also be \(120^\circ\). Not only that, but the two vectors will lie on the \(\mathcal{V}\) plane in three dimensional space. Thus, we can translate the 2 dimensional vectors into vectors located on the \(\mathcal{V}\) plane perpendicular to the tree stem \(\vec{u}\).

Choosing the 2 or 3 vectors on the regular 2d plane later translating them as vectors on the $\mathcal{V}$ plane is a question of doing the following:

  • 2 branches: We first consider \((1,0)\) as the first vector and we determine the second unit-length vector by choosing a random angle between \(160^\circ\) and \(200^\circ\). We choose another random angle between \(0^\circ\) and \(360^\circ\) to rotate the entire set of two vectors.
  • 3 branches: Our three vectors will initially be \((1,0)\), \((\tfrac{\sqrt{3}}{2},-\tfrac{1}{2})\), and \((-\tfrac{\sqrt{3}}{2},-\tfrac{1}{2})\). We then choose a random angle between \(0^\circ\) and \(360^\circ\) to rotate the entire set of three vectors.
The reason we subsequently rotate the two or three vector set is to randomize the orientation of the branches so that our trees branches will project in all sorts of different directions. In either case, to rotate a 2d vector by an angle of \(\theta\) about the origin, we use the following linear transformation:

\[ (x,y)\quad\mapsto (x\,cos(\theta)-y\,sin(\theta),x\,sin(\theta)+y\,cos(\theta)) \]
There are a few remaining details to explain before we go into the R code. As we have already mentioned, the more superficial stems of the three have two branches whereas the deeper parts of the tree (those closest to the original thick stub from which the entire Christmas tree branches out) have three branches. This is controlled by the depth parameter. Whenever \(depth>0.8\), the tree stem will branch out into three branches, otherwise it will have only two. The depth parameter also controls where along the stem the branches are chosen. Stems of greater depth tend to branch off closer to its ending point \((x_1,y_1,z_1)\) whereas the more superficial stems branch off at points closer to its starting point \((x_0,y_0,z_0)\). Needless to say, the branching off points are chosen randomly. Finally, the placement of the lights and the tree decorations is also controlled by the depth parameter. The more superficial parts of the tree have a higher likelihood of containing lights and decorations. Again, the placement of these decorations is done stochastically.

R script for the 3d Christmas tree

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
library(plot3D)

trunk <- list(
  x0=0, y0=0, z0=0,
  x1=0, y1=0, z1=6,
  extension=NULL,
  branch1=NULL, branch2=NULL,
  branch3=NULL, branch4=NULL,
  depth=1,
  lwd=1,
  stub_color="brown"
)

draw_tree <- function(tree) {
  #function to draw the tree recursively
  if (is.null(tree)) return()
  with(tree,{
    segments3D(x0,y0,z0,x1,y1,z1,col=stub_color,
      lwd=lwd,add=TRUE)
    draw_tree(branch1)
    draw_tree(branch2)
    draw_tree(branch3)
    draw_tree(extension)
  })
}

extend <- function(tree) {
  #creates an extension of the main branch
  #in the same direction vector u=p1-p0
  with(tree,{
    growth_factor <- runif(1,min=0.9,max=1)
    return(list(x0=x1,y0=y1,z0=z1,
         x1=x1+growth_factor*(x1-x0),
         y1=y1+growth_factor*(y1-y0),
         z1=z1+growth_factor*(z1-z0),
         extension=NULL,
         branch1=NULL, branch2= NULL,
         branch3=NULL, branch4= NULL,
         depth=depth*0.9,
         lwd=1,
         stub_color="darkgreen"))
  })
}

create_branch <- function(tree,c1,c2) {
  #this function returns a branch for tree
  #c1 and c2 are the components for the vector
  #base {v1,v2}. v1 and v2 are two orthonrmal
  #vectors, both perpedicular to the direction
  #vector u=p1-p0 (p1 and p0 are the endpoints
  #of the tree trunk).
  #where_at is a scalar value in (0,1), indicating
  #where along p0=(x0,y0,z0) and p1=(x1,y1,z1)
  #the branch will grow.
  with(tree,{
    #vector u=(x1-x0,y1-y0,z1-z0) gives the 
    #direction of the tree trunk.
    #stub_length is the length of that trunk.
    u <- c(x1-x0,y1-y0,z1-z0)
    stub_length <- sqrt(sum(u*u))
    #growth_factor is how long the branch
    #will be with respect to stub_length
    growth_factor <- runif(1,min=0.7,max=0.8)
    where_at <- runif(1,min=0.4,max=0.8)
    #create two perpendicular vectors to u
    #and set their norms to 1
    i <- which(u!=0)[1]
    oi <- setdiff(1:3,i)
    v1 <- rep(0,3); v1[i] <- -u[oi[1]]; v1[oi[1]] <- u[i]
    v2 <- rep(0,3); v2[i] <- -u[oi[2]]; v2[oi[2]] <- u[i]
    v1 <- v1/sqrt(sum(v1*v1)); v2 <- v2/sqrt(sum(v2*v2))
    #vector v is a linear combination of v1 and v2,
    #hence it lies on the plane perpendicular to u
    v <- v1*c1+v2*c2
    #beta is the angle of separation between the
    #branch and the trunk. The angle is larger
    #for the "deeper" parts of the tree.
    beta <- runif(1,min=0.7*depth,max=0.9*depth)    
    #calculate the scalar k for multiplying with v
    #so that u+kv, the branch, will form a beta
    #angle with the trunk.
    k <- tan(beta)*sqrt(sum(u*u))/sqrt(sum(v*v))
    #new_u is the direction vector of the branch,
    #whose length will be the length of the branch 
    new_u <- u+k*v
    new_u <- new_u / sqrt(sum(new_u*new_u))*
             stub_length*growth_factor
    #the new (x0,y0,z0) point of the branch
    #is where along the main trunk the branch begins.
    new_x0 <- x0+(x1-x0)*where_at
    new_y0 <- y0+(y1-y0)*where_at
    new_z0 <- z0+(z1-z0)*where_at
    #the new (x1,y1,z1) point of the branch is
    #its ending point.
    new_x1 <- new_x0+new_u[1]
    new_y1 <- new_y0+new_u[2]
    new_z1 <- new_z0+new_u[3]
    list(
      x0=new_x0,y0=new_y0,z0=new_z0,
      x1=new_x1,y1=new_y1,z1=new_z1,
      extension=NULL,
      branch1=NULL, branch2= NULL,
      branch3=NULL, branch4= NULL,
      depth=depth*0.8,
      lwd=1,
      stub_color="darkgreen")    
  })
}

grow_tree <- function(tree) {
  if (is.null(tree) ) return(NULL)
  tree$lwd <- tree$lwd*1.4
  if (tree$lwd>2.0) tree$stub_color <- 'brown4'
  if (is.null(tree$extension)) {
    tree$extension <- extend(tree)
    if (tree$depth<0.5) {
      #make two branches at an angle of more than 160 degrees
      rot1 <- runif(1,min=0,max=2*pi)
      rot2 <- (runif(1,min=8*pi/9,max=10*pi/9)+rot1)%%(2*pi)
      tree$branch1 <- create_branch(tree,cos(rot1),sin(rot1))
      tree$branch2 <- create_branch(tree,cos(rot2),sin(rot2))
    } else { 
      #make 3 branches with 120 degree angles between them and
      #rotate the entire branch set.
      rot <- runif(1,min=0,max=2*pi)
      tree$branch1 <- create_branch(tree,-sin(rot),cos(rot))
      tree$branch2 <- create_branch(tree,
                        sqrt(3)/2*cos(rot)+sin(rot)/2,
                        sqrt(3)/2*sin(rot)-cos(rot)/2)
      tree$branch3 <- create_branch(tree,
                        -sqrt(3)/2*cos(rot)+sin(rot)/2,
                        -sqrt(3)/2*sin(rot)-cos(rot)/2)
    }
  } else {
    tree$extension <- grow_tree(tree$extension)
    tree$branch1 <- grow_tree(tree$branch1)
    tree$branch2 <- grow_tree(tree$branch2)
    if (tree$depth>=0.5) 
      tree$branch3 <- grow_tree(tree$branch3)
  }
  return(tree)
}

create_ornaments <- function(tree) {
  if (is.null(tree)) return()
  po <- (1-tree$depth)^6
  ornament <- sample(c(T,F),size=1,prob=c(po,1-po))
  co <- sample(c("red","darkgoldenrod4"),size=1,
               prob=c(0.6,0.4))
  if (ornament)
    ornaments <<- rbind(ornaments,
      data.frame(x=tree$x1,y=tree$y1,z=tree$z1,color=co))
  create_ornaments(tree$branch1)
  create_ornaments(tree$branch2)
  create_ornaments(tree$branch3)
  create_ornaments(tree$extension)
}

create_lights1 <- function(tree) {
  if (is.null(tree)) return()
  po <- (1-tree$depth)^4
  light <- sample(c(T,F),size=1,prob=c(po,1-po))
  if (light)
    lights1 <<- rbind(lights1,
      data.frame(x=tree$x1,y=tree$y1,z=tree$z1))
  create_lights1(tree$branch1)
  create_lights1(tree$branch2)
  create_lights1(tree$branch3)
  create_lights1(tree$extension)
}

create_lights2 <- function(tree) {
  if (is.null(tree)) return()
  po <- (1-tree$depth)^4
  light <- sample(c(T,F),size=1,prob=c(po,1-po))
  if (light)
    lights2 <<- rbind(lights2,
      data.frame(x=(tree$x1+tree$x0)/2,
                 y=(tree$y1+tree$y0)/2,
                 z=(tree$z1+tree$z0)/2))
  create_lights2(tree$branch1)
  create_lights2(tree$branch2)
  create_lights2(tree$branch3)
  create_lights2(tree$extension)
}

draw_ornaments <- function() {
  with(ornaments,
    {points3D(x=x,y=y,z=z,
              pch=19,cex=1.2,col=as.character(color),
              colkey=FALSE,add=TRUE)})
}

draw_lights1 <- function() {
  with(lights1,
    {points3D(x=x,y=y,z=z,pch="+",cex=0.8,col="white",
              colkey=FALSE,add=TRUE)})
}

draw_lights2 <- function() {
  with(lights2,
    {points3D(x=x,y=y,z=z,pch="+",cex=0.8,col="yellow",
              colkey=FALSE,add=TRUE)})
}

set.seed(20161224)
pine_tree <- trunk
for (i in 1:5) pine_tree <- grow_tree(pine_tree)
ornaments <- NULL;
lights1 <- NULL;
lights2 <- NULL;
create_ornaments(pine_tree)
create_lights1(pine_tree)
create_lights2(pine_tree)

png("tree%02d.png")
for (i in 0:35) {
    perspbox(x=c(-25,25),y=c(-25,25),z=c(0,40),bty="n",
             phi=8,theta=i*10,col="white",alpha=0)
    draw_tree(pine_tree)
    draw_ornaments()
    switch((i%%4)+1,{},
     {draw_lights1()},
     {draw_lights2()},
     {draw_lights1(); draw_lights2()})
}
graphics.off()


Bibliography



Dear Reader:

If you found this post interesting or useful, please share it on Google+, Facebook or Twitter so that others may find it too.

Below you will find a link to download a printable pdf version of this article for off-line reading and the R source code featured in this post, but essentially it's the same information I'm sharing with you in this blog entry. It will take you to a Bitcoin payment gateway which is very secure, anonymous and hassle free (I will not collect your sensitive personal information). The fee is small - 0.0005 BTC - or roughly $0.45 at today's rate. Think of this as a donation, not a business transaction that will help me continue writing content and survive in this hell-hole that Venezuela has become. Thanks in advance!

jueves, 1 de diciembre de 2016

How to send bulk email to your students using R

In this post I write about a technique I use to send personalized bulk emails to my students at the open and distance education university where I teach (the Universidad Nacional Abierta in Venezuela, or UNA, for short). Being able to mass-email students with personalized messages is should be an important issue in any distance education setup. However, as the UNA does not provide teachers with institutional tools to perform such an outreach, we teachers are left to our own devices. I use R for all sorts of teaching and research tasks, and this was not going to be the exception. That's how I came across an R package for sending emails called mailR, which proved to be the right tool for this task.

In what follows I will give some information of context to explain why mass-emailing would be a necessary strategy for us course facilitators to implement at the UNA. If you are only interested in how to implement this using R, you can skip directly to the R implementation section.

A brief outline of the educational process at the UNA

Perhaps many of you are not aware that the UNA was one of the first open and distance education institutions of higher learning in the world when it was founded in 1977, following Great Britain's Open University, created in 19701. Being very much based on the Open University's instructional and organizational model at the time, it is still pretty much a product of the principles and the approach to distance education prevailing back then. It was designed to bring education to the masses in remote or rural areas of the country and to target working adults who could not comply with the time and place constraints imposed by traditional, presential universities. The mass-produced paper book was the principal medium of instruction delivery. Although other media such as vcr tapes, radio/tv broadcasts were at some point produced or contemplated, their use never really caught on.

The production of these books, and indeed the entire educational process of the UNA, is a typical industrial-age process entailing a division of labor among the academic personel of the institution. On the one hand, you have the content specialists, evaluators and validators at the central level of the institution who are directly involved in the book and course production processes. This personnel is also responsible for creating the exams and oher instruments of assessment each semester. Additionally, they are responsible for scheduling the exam dates of all courses each semester. Content specialists, evaluators and validators do not have contact with students; that is the role of course facilitators, such as myself, who operate at the periphery of the institution in the various regional locations across the length and breadth of Venezuela.

The primary task of the course facilitators is to serve as a bridge between students and the contents of the course embodied in the course textbooks. This means that if, in the process of self-study, the student has issues understanding some contents of the course, he or she contacts the facilitator for individual assistance. Ocassionally, facilitators also hold workshops where groups of students attend. However, attendance to these workshops, ressembling a more traditional learning environment in which a teacher addresses the class to expound some subject, is optional. Grading exams is another task facilitators do, although actually administering the exams is something done by another type of personel - the proctors - who are not necessarily academics.

As can be seen, the distance learning model of the UNA involves a separation of teaching tasks and it is centered around the concept of self-learning. Facilitator-student interaction is predominantly presential. I don't have any statistics to back this up, but I do believe that for the majority of the UNA facilitators, phone interaction with students occurs very rarely and internet interaction (email, forum or video-conferencing) is virtually (no pun intended) inexistent. In this, regard, my presonal case is an outlier. For example, for the july-september trimester, of the total 1643 facilitator-student interaction events, 1605 of those were online, in the form of grade consultations in my online system, messages in my blog's chatbox, email consultations and of course, the bulk emailing system I'll be presenting in this entry.



On the necessity of creating a better communicational platform

As it was made evident in the previous section, the UNA's model of distance education is outdated and needs to be urgently revamped. Unfortunately, I don't see this happening soon from within the institution itself, which maintains a very conservative structure and hierarchy. I've assumed this blog to be my own personal experiment as to how a facilitator like me could create an alternative platform for teacher-student interaction using free or easily available technology. This new structure has to take into account the "natural" communication channels.

By far, Facebook is the most frequently used social network by the student population in Venezuela (and the general population of this country, for that matter). However, as an educational tool, Facebook has its disadvantages: it is a very noisy and distracting communication channel more oriented towards aimless chatter and gossip2. Furthermore, although Facebook is immensely popular, there is a more universal communication channel in the digital era: email.

In 2013, there were approximately 1.19 billion Facebook users and 0.2 billion Twitter users in the world versus 2.5 billion email users3. In spite of seeming less "fashionable" than the popular social media today, email is, and will certainly continue to be, the most widely used digital communication tool, if only because to register for any of these social media, you need to provide an email address! Truisms aside, email does present some important advantages over social media in the educational context: one can be more certain that a student will read a message if we send it by email than if we post it on Facebook, Twitter, Google+, etc. And besides being a less noisy communication medium, email is more personal. Therefore, in the distance education context, email is arguably the most natural communication channel between facilitator and students.

In distance education, distance is both a strength and a weakness of this instructional modality. Distance is an essential attribute of open and flexible education in which the learner's autonomy, time and obligations to family and work are respected. However, too much distance alienates the student from the learning community, which in my opinion accounts for a large portion of dropouts in MOOCs and other distance learning contexts. This is where email comes in. Addressing each student by name and making the student feel that the conversation is about him or her personally works wonders in these contexts, where communication is usually more generical and inclined towards anonymity. Email enables the instructor to create a more personal rapport with the students and making them feel that they are being taken into account personally.

This is the reason why email is the tool of choice for telemarketing and this is why media experts like Michael Hyatt, Anna Hoffman, Neil Patel or Jon Morrow insist that the most important asset of a blog is its list of email subscribers. Surely, telemarketing brings to mind ideas about sales, profits and customers, whereas education is seen as an entirely different phenomenon. At the end of the day, however, we educators are trying to exert influence over our students and if we consider education to be a service to others, then it wouldn't be a bad idea to consider our students as customers. Exerting influence, leadership and using digital communication platforms- these are some of the topics Michael Hyatt discusses in his blog4.

Why do I employ R with the mailR package for bulk emailing instead of other tools?

There are, to be sure, services like MailChimp which allow you to run bulk emailing campaigns. Although MailChimp has a free service that allows you to send up to 12000 emails a month to over 2000 subscribers5 and it seems very easy to set up and get running, for me, it has one major disadvantage: due to DMARC policies, MailChimp cannot send bulk emails from adresses associated to free-service provider domains, like Gmail, Yahoo and Outlook. Surely, academic faculty members of the UNA have our UNA-domain email address (mine is jromero@una.edu.ve). However, I rarely use it and for me, it's more convenient to use my gmail address (it's my natural communication channel). Furthermore, I'm not sure about the bulk-emailing restrictions on my una-domain email address, although in truth, I haven't explored the matter throughly. As for purchasing a hosting/domain+email addresses package adequate enough in terms of server response times and maximum concurrent users online, while I'm sure this is very inexpensive in the rest of the world, it is almost prohibitive for most Venezuelans, whose socialist government has imposed an extremely restricted currency exchange mechanism6.


There is another disadvantage to using ready-made tools like MailChimp, in my case. While addressing recipients by their name is something easily done with MailChimp, I'm not sure if you can easily configure for using male or female gender adjectives (in Spanish, adjectives change in form based on the gender of the thing or person described). Or for example, as a teacher I might be interested in saying something or other on my message based on certain conditions of each individual student (eg. I might advise failing or at risk of failing students to take such and such remedial action). These sort of things require a tool for programmatically tailoring the message to each student. They require using a general programming language like R. At least to me, it seems much easier this way, since I already use R for a whole bunch of other tasks, as this blog testifies to.

Now that I have (I hope) justified my use of R for bulk emailing tasks, I must point out that there are several packages for that. There is sendmailR, mailR and gmailr. The latter package looks interesting and on the author's GitHub site7, there is a tutorial of sorts on how to set it up for sending bulk emails from a Gmail account, coincidentally with an online course application example such as the one I'm discussing in this post. However, I had tried this package before and for some reason, could not get it to work. In deciding between sendmailR and mailR, I considered that mailR's send.mail function has an option for sending pure text or HTML mails, whereare the equivalent sendmailR function had text mail only hardcoded within (this is a bug that they apparently fixed now8). Being able to end HTML was important to me, as I wanted to be able to send nicely formatted emails like this one:

UNIVERSIDAD NACIONAL ABIERTA
CENTRO LOCAL ANZOATEGUI
UA EL TIGRE

Hola PEDRO:

Ante todo quiero desearte éxito en este semestre 2016-1 y particularmente en la primera prueba parcial de la asignatura XXX (código xxx) que presentarás este sábado. El propósito de este mensaje es presentarme: mi nombre es José Romero, soy egresado en la Licenciatura de Matemáticas de la UNA y actualmente, soy el asesor del área de matemáticas en la unidad de apoyo de El Tigre.

Estoy contactando por correo a todos los estudiantes de la asignatura XXX (xxx) de la UNA a nivel nacional para invitarlos a que visiten mi blog.

Huelga decir que en caso de cualquier duda, no dudes en consultarme por el buzón de mensajes del blog o a través de mi correo: jlaurentum@gmail.com. (NOTA: No respondas a este correo ya que es una cuenta para envios automatizados solamente). Estoy a tus ordenes,

Atentamente,


José Romero



Materials needed for this experiment

Besides an R installation with the mailR package, you will need a file with the data of your students: their email addresses, their names, and any other relevant information you wish to convey to them, such as grades, personal feedback for each student, etc. The file needs to be a csv file, which is essentially a text file in which each line is a row of the data table and the fields or columns in each line are separated by a special character such as a comma, a semicolon, or a tab9. If you have your data in a spreadsheet, you can easily convert this to a csv file by using "Save As" and then choosing the "Text/CSV" file type. Indicate the separation character- that will be the same character you indicate when you read in the csv file from R. Your csv file could contain something like this:

id lastname firstname gender c_code una_location email_address 12345678 PEREZ PEDRO M 126 02-01 pedroperezm@dontcare.com 87654321 PEREZ JOSEFINA F 126 02-01 jperez@noneofyourbizwaks.com : : : : : : : : : : : : : :
It is worth noting that with this R/mailR method, you can send up to 100 email messages a day from any of the free email domains such as Gmail, Yahoo or Outlook. If you go beyond this limit, your email account will be temporarily suspended for a day (24 hours), after which you can continue to send messages as usual (never exceeding the 100 emails a day maximum quota). In experimenting with this method, I discovered I had to use a time delay between each email. In the script below, you will see that I uniformly distributed random delays between 3 and 6 seconds. Maybe, by using bigger max/min values and a greater range (spread) for the random delays, you can ensure that the message sending process will proceed 100% smoothly, without having the script interrupted by this error:

Error in ls(envir = envir, all.names = private) : invalid 'envir' parameter
Should you get the above error, you simply have to modify the first and last indexes in the for-loop so that the message sending script will resume the batch sending process from the last student where you left off. The script is the following:

R script for batch emailing

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
#Batch email sending script
#The following function crafts the message for each student,
#represented by the x parameter
message_text <- function(x) {
  tmp <- paste(c(
    '<!DOCTYPE html>',
    '<html xml:lang="es" lang="es">',
    '<head>',
    '<meta http-equiv="content-type" content="text/html;charset=utf-8" />',
    '<meta name="author" content="José Loreto Romero Palma" />',
    '<STYLE type="text/css">',
    'h1 { text-align: center; font-family: Helvetica, FreeSans; font-size: 40px;',
    'font-variant: small-caps }',
    'h2 { text-align: center; font-family: Helvetica, FreeSans; font-size: 32px;}',
    'h3 { text-align: center; font-family: Helvetica, FreeSans; font-size: 28px;}',
    'p { text-align: justify; font-family: Helvetica, FreeSans; font-size: 16px;',
    'width: 640px}',
    'ul { text-align: justify; font-family: Helvetica, FreeSans; font-size: 16px}',
    'td { font-family : Helvetica, FreeSans; font-size: 12px}',
    '</STYLE>','</head>','<body>',
    paste(c(
    paste0("<table><tbody><tr><td width='60px'>",
    "<IMG SRC='https://lh3.googleusercontent.com/hgDApmnzAf2TrP9lzTUc3U9ZG9",
    "EBaUn9s9OM4DWD4BXtO6j51GCfgLyfyjTdHJ5G8CfXD6_XeipYaQ=w1366-h768-no' ",
    "NAME='logo_UNA' ALIGN='LEFT' WIDTH='51' HEIGHT='51' BORDER='0'>",
    "</td><td width='580px'>UNIVERSIDAD NACIONAL ABIERTA<br/>",
    "CENTRO LOCAL ANZOATEGUI<br/>UA EL TIGRE</td></tr></tbody></table>"),
    paste0("<br/>"),
    paste0("<p>Hola ",x$firstname," ",x$lastname,":</p>"),
    paste0("<p>",
    "Ante todo quiero desearte éxito en este semestre 2016-1 y ",
    "particularmente en la primera prueba parcial de la asignatura ",
    "XXX (código xxx) que presentarás ",
    "este sábado.  El propósito de este mensaje es ",
    "presentarme: mi nombre es <b>José Romero</b>, soy egresado en la ",
    "Licenciatura de Matemáticas de la UNA y actualmente, soy el ",
    "asesor del área de matemáticas en la unidad de apoyo de El Tigre.</p>"),
    paste0("<p>",
    "Estoy contactando por correo a todos los estudiantes de la asignatura ",
    "XXX (xxx) de la UNA a nivel nacional para invitarlos a que ",
    "visiten mi <a href='http://unamatematicaseltigre.blogspot.com'",
    " target='_blank'>blog</a>.</p>",
    "<p>Huelga decir que en caso de cualquier duda, no dudes en ",
    "consultarme por el buzón de mensajes del blog o a través de mi ",
    "correo: <a href=\"mailto:jlaurentum@gmail.com\">jlaurentum@gmail.com</a>.",
    " (NOTA: No respondas a este correo ya que es una cuenta para envios ",
    "automatizados solamente). Estoy a tus ordenes,</p>"),"","",
    "<p>Atentamente,</p>","<br />","<p>José Romero</p>")),
    '</body>','</html>'),
    collapse="")
    return(tmp)
}

library(mailR)
#note: turn on/off less secure apps access for gmail (?)
#note: the daily quota is 100 emails 
#read in the mailing list in the csv
mail_list <- read.csv2("roster.csv",as.is=TRUE)
file_h <- file('test.html',open="w")
writeLines(message_text(mail_list[1,]),file_h)
close(file_h)
ext_f <- file("mail.out",open="w")
close(ext_f)
for (recipient in 1:nrow(mail_list)) {
  email <- message_text(mail_list[recipient,])
  send.mail(from="mygmail@gmail.com",
    to=as.character(lista_correos[recipient,]$email_address),
    subject="Invitation to the unamatematicaseltigre blog",
    body=email,
    html=TRUE,
    authenticate=TRUE,
    smtp = list(host.name = "smtp.gmail.com",
    user.name = "mygmail", passwd = "mypasswd", ssl = TRUE),
    encoding = "utf-8",send=TRUE)
  print(mail_list[recipient,])
  Sys.sleep(runif(n=1,min=3,max=6))
  #write each recipient to a file
  ext_f <- file("mail.out",open="a")
  writeLines(text=paste0("[",recipient,"] ",
    paste0(as.character(mail_list[recipient,]),collapse="\t")),
    sep="\n",con=ext_f)
  close(ext_f)  
}


Lines 4-52 define a function - message_text that builds the HTML message body for a given recipient. This script produces a message body such as the one shown in the example above. Notice the meta tags in the header defining the CSS to give format to your message and the ability to include images like the UNA logo in line 23-25.

Line 58 reads in the roster.csv file as a data frame placed in the mail_list variable. The read.csv2 function assumes your separation character is a semicolon (;) and the decimal point is the , (this is so in spanish speaking countries). However, you can configure other settings by using the read.table or read.csv functions. Lines 59-61 simply write a test message to an HTML file. I usually run the script up to these lines first to ensure that the message text comes out like I want it before starting the batch sending processs. The for loop in lines 64-83 is responsible for batch emailing to all the students in the roster data frame.

The index variable recipient of the for loop (line 64) is an integer going from 1 to the number of rows in the roster data frame. Bear in mind that the daily quota is 100 emails a day if you are emailing from a free email account like gmail or yahoo. Therefore, if your roster file consists of more than 100 students, you will want to split this up into groups of 100 students, batch sending for each group on different days. Besides, there may be problems while emailing to a particular address that cause the script to stop with an error (the Error in ls(envir = ... mentioned earlier). Therefore, you need some way of keeping track of the emails that you send.

This is the reason I create a file (lines 62-63) where I will be writing the information of each individual row in the roster as I send each message (lines 78-82). If the script execution stops for some reason (besides the error mentioned above, power blackouts are quite common in Venezuela), I can use this file to see from what row of the roster I should resume the batch sending. Besides, it's always a good idea to keep a record of all emails sent. While running this code, If you open the email client from a browser, you will see how the Sent email box starts to populate with messages as the script sends them. If for some reason the postmaster cannot send to a certain email address, you will get a notification email in your Inbox.

The actual email sending is done via the send.mail function of the mailR package in lines 66-74. In this example, I'm sending from a fake gmail account: mygmail@gmail.com. The user name before the ad sign of your gmail address is the one you will pass to the user.name parameter in line 73. In the same line of code, you also have to indicate the password you use to login to your email account as parameter to passwd. The host name for gmail addresses is smtp.gmail.com, but if you use yahoo or some other free email provider, you have to find what the smtp host name is for that provider and indicate it in the host.name parameter at line 72. A Google lookup should suffice.

As already mentioned, I introduce a random delay between each call to the send.mail function. This is so my gmail client won't suspect I'm sending emails in "automatic pilot mode" and my account won't be temporarily suspended. For me, random delays between 3 and 6 seconds worked fairly well, but you may want to experiment with higher delay values to be safe.

Finally, if you have any questions or comments, feel free to let me know in the comment section of this post. I'd be happy to answer them.

Notes

  1. See McIsaac and Gunawardena (2002) for an account on the history and theoretical constructs behind distance education.
  2. See What Is Google+? (An African Perspective), an interesting post by Rotimi Orimoloye for his blog Digital Africa. In it, he argues that Nigerians, who also use Facebook more often than any other social network, should transition into Google+, the latter being more suitable for getting to know who the experts in a particular field are and to engage in learning about these fields.
  3. See the statistics in the FAQ section of Specific Feeds: https://www.specificfeeds.com/page/faq-email-publishers.
  4. Michael Hyatt, author of a book titled "Platform: Get Noticed in a Noisy World", is a expert in this subject. I strongly reccomend visiting his blog https://michaelhyatt.com.
  5. While there are experts on the subject of platforms like Michael Hyatt and the others I have mentioned, I think that out of necessity, I'm on my way to becoming an expert myself on the subject of building a platform with free or freely available tools. I believe that while technology is in some cases widening the gap between the rich and the poor, free and open source technology holds enormous potential as empowering tools for people who, like myself, live in countries with failed economies.
  6. See Premraj, 2014.
  7. Hence the acronym CSV: Comma Separated Values. The separation character can be any character you choose. For this example, we will assume the semicolon (;) is the separation character.

Bibliography




If you found this post interesting or useful, please share it on Google+, Facebook or Twitter so that others may find it too.


domingo, 20 de noviembre de 2016

Telemercadeo por correo electrónico usando R

En esta entrada abordo el problema del telemercadeo en una institución de educación a distancia como la UNA. No, no me refiero a que tenemos un problema de docentes-vendedores o estudiantes-vendedores en nuestra comunidad universitaria, ni mucho menos pretendo insinuar que la educación en la UNA sea un negocio y las personas que hacemos vida en esta comunidad unos mercaderes. Pero si consideramos que el telemercadeo es una forma de contacto directo entre un asesor y sus clientes por medio de la telecomunicación, me concederan que las técnicas de telemercadeo son útiles en una institución como la UNA, donde nuestros clientes son los estudiantes. Específicamente, en esta entrada voy a exponer un método para el envio masivo de mensajes de correo eletrónico personalizados a muchos estudiantes. Sin duda, esto será de mucho interés para mis colegas asesores en la UNA y quizás, hasta para aquellos estudiantes que necesiten hacer telemercadeo como herramienta de negocios.

miércoles, 5 de octubre de 2016

R Puzzle No. 2 - Where's the zebra?

For the second R puzzle, we will tackle the famous Zebra Puzzle, sometimes known as Einstein's Puzzle or Einstein's Riddle. There are various implementations of solutions to the puzzle in different programming languages1, but here we will be using R and an R package called lpSolve, which is intended to solve linear programming problems. I had a lot of fun preparing this entry and learned quite a bit on how to implement linear programming models using R - I hope you will find this post entertaining, interesting and useful.

The Puzzle


The Zebra Eye, by Singapore artist Jarrell Goh aka fuzzyzebra (2007).
The Zebra puzzle, or Einstein's riddle as it is often called, was popularized in a 1962 number of Life International magazine, with a follow-up in a March 1963 issue of Life containing the solution. It is said to have been invented by a young Einstein, although it is also credited to Lewis Carroll. The truth is, there is no clear evidence of either's authorship of the riddle. I should also point out that various versions of the puzzle exist, with different nationalities, brands of cigarettes, or even replacing sports teams with cigarettes, etc.

The version of the puzzle we will be using here is as follows. There are five houses along a street whose five occupants are all of different nationalities (English, German, Norwegian, Swedish, and Danish). Each occupant also smokes a different brand of cigarette (five brands in all), has a different favorite drink, has a different pet and owns a house of a different color than his neighbors. The following clues are given:


  • The Englishman lives in the red house.
  • The Swede has a dog.
  • The Dane drinks tea.
  • The man in the green house drinks coffee.
  • The white house on the left of the green house.
  • The man in the yellow house smokes Dunhill.
  • The man who smokes Pall Mall has a bird.
  • The German smokes Prince.
  • The man who drinks beer smokes Bluemaster.
  • The man that smokes Blends lives next to the man with a cat.
  • The man with a horse lives next to the one that smokes Dunhill.
  • The man who smokes Blends lives next to the man that drinks water.
  • The Norwegian lives next to blue house.
  • The man in the middle house (third house) drinks milk.
  • The Norwegian lives in first house.
If you have never tried to solve this puzzle, you should. It'll give you some moments of good entertainment. I suggest you make a table with 5 columns and 5 rows in which the columns represent the five houses, numbered 1-5 from left to right and the rows represent the nationalities, cigarette brands, pets, favorite drinks and house colors. Start with the most direct clues and incrementally reach the solution by using one clue at a time. Once you reach the solution, reflect on how you did it. What algorithm did you use? How did you represent the data? And finally, how would you program a computer (using R specifically) to solve this puzzle? This is when the real fun begins...

A mathematical approach

Like I said, there are various code implementations in various programming languages that you can find on the Internet. On the rosetta.org site (listed in the References section), there's even an R solution using an R-package called "combinat", which I confess I'm not familiar with but I assume it's similar to the Python "constraints" module. In this post I'll take a different approach (using linear programming), but first, let's lay down the basic strategy for solving this puzzle computationally.

If you tried to solve the puzzle by hand using a 5x5 table like the one I suggested, you may have probably considered the blank table cells as the problem variables, to be filled by a specific drink, nationality or other characteristic. If we are to solve this puzzle by coding it, we should take another approach: we should consider instead the 25 characteristics (brands of cigarettes, nationalities, house colors and so on) as the variables, each variable assuming an integer value between 1 and 5 representing the number of the house where that specific item is located. Doing so brings our puzzle within the realm of mathematics.

Without any restrictions, we simply assign to each of the 25 variables a number between 1 and 5. There are \(5^{25}\approx 2.98\cdot 10^{17}\) possible ways of doing so. If we next assume that there are no two houses with occupants of the same nationality, pets, brands of cigarettes, etc., then we narrow the solution space down to \((5!)^5=24,883,200,000\) possible cases, which is still a considerable number of cases. Each of the puzzle's clues is really an additional constraint and together, these constraints reduce the possibilities to a single definitive solution.

So once we have defined our 25 variables identifying them according to the specific item they represent, such as Horse, Norwegian, Tea, Red, and so on, how do we mathematically represent the constraints? Consider the first clue for example: the Englishman lives in the red house. This can be represented mathematically as \(English=Red\). Clues like "the occupant of the middle house (third house) drinks milk" can be expressed as \(Milk=3\). If the white house is to the right of the green house, we could simply write this as \(White=Green + 1\). If the Norwegian lives in the first house, we represent this by \(Norwegian=1\). The "x is next to y"-type clues are a bit trickier though.

What does "x is next to y" or "x and y are neighbors" mean? It simply means that their house numbers differ by one. So, for example, "the Norwegian lives next to the blue house" could be expressed mathematically as \(|Norwegian-Blue|=1\), which in turn means that either \(Norwegian-Blue=1\) or \(Blue-Norwegian=1\). Finally, since we require all variables to be bounded between 1 and 5, we could impose these constraints as inequalities: \(Red\geq 1\) and \(Red\leq 5\), for all 25 variables. All in all, what I mean to show you is that all constraints can be expressed as a system of equations and inequalities. More specifically, these equations and inequalities are all linear.

This is where linear programming comes in to the picture. In linear programming, we attempt to maximize or minimize an objective (linear) function, usually representing a cost as a function of the decision variables, when we impose constraints as a system of linear inequalities like \(Ax\leq b\) or \(Ax\geq b\). In linear programming, \(A_{n\times m}\) is a matrix of coefficients whose m columns represent the m decision variables and the n rows represent the n constraints as inequalities. On the right-hand side, b is a vector whose n components represent the right-hand side of each constraint inequality.

To give an idea about what a linear programming problem looks like, I'll use the following example from Taha (1982): A paint factory that produces interior and exterior paints uses two types of raw materials (A and B). The maximum availability of A is 6 tons per day, whereas for B it is 8 tons per day. To produce 1 ton of exterior paint, they require 1 ton of raw material A and 2 tons of raw material B, whereas for 1 ton of interior paint, they require 2 tons of A and 1 ton of B. Market surveys indicate that the daily demand for interior paint cannot exceed that of exterior paint by more than 1 ton. The survey also indicates that the maximum demand for interior paint is limited to 2 tons daily. Furthermore, the wholesale price per ton of interior paint is $2000 and for exterior paint it is $3000. How much interior and exterior paint should the factory produce to maximize gross income?

In the example problem given, it is clear that our two decision variables are \(X_I\) and \(X_E\) representing the daily production (in tons) of interior and exterior paint respectively. Then we could express the usage restrictions on the usage of raw materials and maximum raw material availability as follows:

\[\begin{align*} X_E + 2X_I &\leq 6 \quad \text{(raw material A)}\\ 2X_E + X_I &\leq 8 \quad \text{(raw material B)} \end{align*}\]
The demand restrictions (according to the market surveys), could be expressed as

\[\begin{align*} -X_E + X_I &\leq 1 \quad & \text{(the demand for interior paint cannot exceed}\\ {} & {}\qquad & \text{that of exterior paint by more than 1)}\\ X_I &\leq 2 \quad & \text{(the daily demand for interior paint is less than 2)} \end{align*}\]
And then of course there are the standard non-negativity restrictions on the decision variables that are tacitly assumed in any linear program: \(X_E, X_I \geq 0\). We want to maximize the gross income, and this income is given by \(z=3000X_E+2000X_I\), which is a linear function on the decision variables. All in all, our linear program would be:

Maximize \(z(X)=Z\cdot X\), subject to \(AX\leq B\), where \(X=(X_E,X_I)\), \(Z=(3000,2000)^T\), \(B=(6,8,1,2)^T\), and \(A\) is the following matrix of coefficients:

\[A=\left(\begin{array}{cc} 1 & 2\\ 2 & 1\\ -1 & 1\\ 0 & 1\\ \end{array}\right)\]

With \(X_E, X_I\geq 0\). Alternatively, in non-matrix form, our problem would be to maximize \(z=3000X_E+2000X_i\) subject to the following restrictions:

\[\begin{align*} 1X_E + 2X_I &\leq 6\\ 2X_E + 1X_I &\leq 8\\ -1X_E + 1X_I &\leq 1\\ 0X_E + 1X_I &\leq 2 \end{align*}\]
Again, subject to the non-negativity restrictions on \(X_E\) and \(X_I\).


The lpSolve package

It is not my intention in this post to give a thorough explanation of how the simplex algorithm used in linear programming works, nor delve into the principle of duality in linear programs. My intention is a more pragmatic and modest one- to show how to use one of the R packages dedicated to linear programming to solve such problems. Of course, knowing the inner workings of the mathematical tools could be useful in some cases, but in most real-life scenarios, it suffices to be able to identify the tool needed, to set up the problem in terms of the inputs that the tool requires, and to be able to assess and interpret the output. Specifically, we will make use of the "lpSolve" R package, which is an R wrapper to the lp_solve version 5.5 functions freely available under the LGPL 2 software license2. So, to continue with the example problem we defined above, the R code for solving it would be as follows:

1
2
3
4
5
6
7
8
library(lpSolve)
z <- c(3000,2000)
B <- c(6,8,1,2)
A <- matrix(c(1,2,2,1,-1,1,0,1),ncol=2,byrow=TRUE)
colnames(A) <- c("X_E","X_I")
dir <- rep("<=",4)
(result <- lp("max", z, A, dir, B))
result$solution
Success: the objective function is 12666.67
[1] 3.333333 1.333333
In the code above, we obtained a maximum gross income of $12666.67 with 3.33 tons of exterior paint (\(X_E\)) and 1.33 tons of interior paint (\(X_I\)). In line 4 we define the constraints matrix A. In the next line, we give names to the columns that correspond to the decision variables. This is not necessary, but the user must be aware that the solution vector will be given in terms of these variables in the order that they appear on the constraints matrix. Notice that the lp function call in line 7 requires we specify whether we want a maximum or minimum for the objective function, in this case given by the vector z defined in line 2. A neat feature of the lp function is that for each constraint row of the constraints matrix, we can specify a direction for the inequality as "<=", "=" or ">=", so not all restriction inequalities have to be of the same type.

Granted, our zebra puzzle is quite more complicated than this example problem. It involves at least 25 decision variables (the 5 nationalities, 5 pets, 5 colors, 5 cigarette brands and 5 drinks). But having seen how we mathematically expressed some of the restriction clues, it is clear that they all involve inequalities or equalities, which suits us fine because the lp function in lpSolve can handle any of those three types of restrictions for each constraint separately.

An important point to consider though, is that we are only interested in integer solutions for the decision variables. Although classic linear programming is intended to solve for real solutions on the decision variables, there are variants of the linear programming method, called integer linear programming, which restrict the solutions to integer values. Fortunately, lp can handle these cases and even the more restricted cases where the decision variable is binary (either 0 or 1). As we shall see, it is very easy to indicate which variables we want to be real, integer, or binary. So on that aspect of our zebra problem requiring that our variables be restricted to integer values (1,2,3,4 or 5) indicating the house where each item is, we have nothing to worry about.

Not so fast, cowboy

It would seem like just by invoking the lpSolve library, our zebra problem has been solved. But it isn't. For one thing, how can we efficiently and most simply represent the 25 decision variables and build the constraints matrix? Do we have to build this matrix manually, perhaps using a spreadsheet program and reading it in as a csv file from R? Instead of doing that, I decided to use a neat feature of R which allows me to index elements of a vector by using character-string names instead of integer numbers. Such vectors are called named vectors. So what I will do is first define a character vector with all the variable names. For each row in the restriction matrix, I will start with a numeric vector of 25 elements all set to 0 and name that vector with the variable names I defined previously. If I want to set a particular element of that constraint row to a certain value, I just simply index it by the variable name instead of using a number. To keep things simple as I show you how to do this, we will pretend that we only have four decision variables: "W","X","Y" and "Z".

To begin with, how would we define the constraint "X is in house 3"? As we have previously seen, that is tantamount to \(X=3\), so our constraint matrix row would be defined in R as follows:

variable_names <- c("W","X","Y","Z")
row1 <- rep(0,4)
names(row1) <- variable_names
row1["X"] <- 1
row1
 W X Y Z 
 0 1 0 0

The named vector row1 contains the coefficients for the constraint. The direction of that constraint would be "=", and the corresponding element of the right-hand side vector of the (in this case) equality would be 3. Now say we want to indicate that "W and Z are in the same house". That would be equivalent to stating that \(W=Z\) or \(W-Z=0\). In that case we would add a second row to are constraint matrix as follows:

row2 <- rep(0,4)
names(row2) <- variable_names
row2["W"] <- 1
row2["Z"] <- -1
row2
  W  X  Y  Z 
  1  0  0 -1

As in the previous case of row1, the direction (sign) of that constraint would again be "=" and the corresponding right-hand side value would be 0, since we want \(W-X=0\). You may be thinking that perhaps this is a complicated way to do this, but consider that with 25 variables per linear constraint where most of the coefficients are zero, it makes a lot of sense to set the coefficients only of one or two variables we are dealing with. In effect, we are seeing that the constraints matrix will be a sparse matrix (mostly filled with zeros). Constraints like "W is to the right of Y" are mathematically expressed as \(W-Y=1\) and it wouldn't be too hard to see how that constraint row would be defined in R.

Now when we come to a constraint like "X and Y are next to each other", that means that they differ by 1: \(|X-Y|\leq 1\). This absolute-value inequality results in two simultaneous inequalities: \(X-Y \leq 1\) and \(Y-X\leq 1\). These two inequalities can be easily represented as constraint rows in R as we have seen. The catch is that \(-1\leq X-Y \leq 1\) does not exclude the possibility that \(X-Y=0\), that is, that X and Y be equal. However if "X and Y are next to each other, they cannot occupy the same house, so to the \(|X-Y|\leq 1\) constraint we need to add \(X-Y\neq 0\). The problem is, how do we add a strict inequality constraint to a linear program when our only "direction" operators are "<=", "=" and ">="?

We want X and Y to be different (not equal), and since X and Y are limited to integer values, this is equivalent to stating that \(|X-Y|\geq 1\), which in turn means that either \(X-Y\geq 1\) or \(Y-X\leq -1\). However, we cannot have an "either/or" disjunction condition in a linear program, since all the conditions must be met simultaneously (through the "and" conjunction). We could use de Morgan's laws and state that transform these statements as I show below, , but we would just be going around in circles.

\[\begin{align*} X-Y\geq 1\;&\vee\;X-Y\leq -1 \equiv\\ \neg (X-Y\lt 1\;&\wedge\;X-Y\gt -1) \equiv\\ X-Y&\neq 0 \end{align*}\]
Fortunately, there's a "trick" that I discovered on the lp_solve reference guide web-page3 that one can use to incorporate constraints like \(|X-Y|\geq 1\) into a linear program. To generalize as much as possible in what follows, we will assume that \(X=a_1X_1 + \ldots + a_mX_m\) is a linear expression on the decision variables. We want to be able to express \(|X|\geq min\) as a valid linear programming constraint. The trick is to introduce an additional binary decision variable \(B\) that can only assume two values: 0 or 1.

\[\begin{align*} X+M\cdot B &\geq min\qquad (1) \\ -X+M(1-B) &\geq min \end{align*}\]
In the above, \(M\) is simply a big, positive constant. How big does it have to be is something we will discover shortly. For now, bear with me and let's see what happens if \(B=0\). In that case, the inequalities in (1) become:

\[\begin{align*} X&\geq min \qquad (1a)\\ -X+M &\geq min \end{align*}\]
In this case (\(B=0\)), \(X\geq min\) and since we are assuming that M is a large enough constant, the second condition will always be fulfilled. If \(B=1\), then (1) is changed to the following:

\[\begin{align*} X+M&\geq min \qquad (1b)\\ -X&\geq min \end{align*}\]
In that case (\(B=1\)), the condition \(-X \geq min\) must hold, which can only mean that \(-X\) is positive (X was negative). The first condition in (1b) automatically holds because we have been assuming that M is large enough anyways. So how large is "large enough"? Well, from either (1a) or (1b), we can see that \(M\geq min+|X|\). While we could in theory use a really large value for M like \(1\cdot 10^{20}\), it is not advisable to do so because that would create numerical instabilities with our linear program solver.

If we can somehow put an upper bound to \(|X|\) in our problem, then we can find a feasible value for M. In our zebra puzzle, say we have two variables like "Norwegian" and "Blue" and we want to express that these have different values \(|Norwegian-Blue|\geq 1\). By looking at the expression within the absolute value brackets \(Norwegian-Blue\), we would have to see how large the difference can become. Since all decision variables in our problem are within the 1-5 range, we can see that this difference, in absolute value, cannot be larger than 4. So that means that any value of M larger than \(4+1=5\) will suffice. In my R solution, I'll be using a value of 20 for M. At any rate, a "x not equal to y constraint" implies using one additional binary decision variable and two additional constraint rows, as shown in (1), in the linear program. Now we are ready to tackle the zebra puzzle in R, so without further ado, here's the script.

And finally, the R solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#define the variables, each assuming integer values between 1 and 5
nationalities <- c("Dane","English","German","Norwegian","Sweede")
drinks <- c("Beer","Coffee","Milk","Tea","Water")
smokes <- c("Blend","Bluemaster","Dunhill","Pallmall","Prince")
pets <- c("Bird","Cat","Dog","Horse","Zebra")
house_color <- c("Blue","Green","Red","White","Yellow")
variable_names <- c(nationalities,drinks,smokes,pets,house_color)
#add binary variables for defining inequalities:
#4 for the problem constraints
#50 for requiring that the variables within each group
#(colors, nationalities, pets, etc.) have to be different
variable_names <- c(variable_names,sprintf("B%02d",1:54))
#define maximum bound for inequalities
M <- 20
#define Binary variable counter
b_counter <- 1
#define blank constraints, dir and rhs
constraints <- NULL
dir <- character(0)
rhs <- numeric(0)
#functions for defining constraints
x_is_y <- function(x,y) {
  #x and y are in the same house
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,0)
}
x_is_in_n <- function(x,n) {
  #x is in house number n
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,n)  
}
x_to_left_of_y <- function(x,y) {
  #x is on the house to the right of y
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,-1)  
}
x_is_not_y <- function(x,y) {
  #function that defines x and y as not equal
  #1st inequality constraint
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  B_variable <- sprintf("B%02d",b_counter)
  new_row[x] <- 1
  new_row[y] <- -1
  new_row[B_variable] <- M
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1)
  #2nd inequality constraint
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- -1
  new_row[y] <- 1
  new_row[B_variable] <- -M
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1-M)
  row.names(constraints) <<- NULL
  b_counter <<- b_counter+1  
}
x_is_next_to_y <- function(x,y) {
  #x is next to y (x and y are neighbors)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,1)
  #2nd neighboring constraint row
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- -1
  new_row[y] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,1)
  #now define inequality constraints
  x_is_not_y(x,y)
}
inequality_within_group <- function(var_group) {
  #function that defines all variables within a group
  #to be different from each other.
  #first obtain all possible pair of group variable combinations
  group_combinations <- t(combn(var_group,2))
  #now loop through this table, adding constraints
  for (i in 1:nrow(group_combinations)) {
    x_is_not_y (group_combinations[i,1],group_combinations[i,2])
  }
}
variable_bounds <- function(x) {
  #function that creates constraints for bounds of 
  #problem variable (must be in 1-5 range)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,5)
  row.names(constraints) <<- NULL
}
#Now define constraints in Prolog fashion
x_is_y("English","Red")
x_is_y("Sweede","Dog")
x_is_y("Dane","Tea")
x_is_y("Coffee","Green")
x_is_y("Dunhill","Yellow")
x_is_y("Pallmall","Bird")
x_is_y("German","Prince")
x_is_y("Bluemaster","Beer")
x_to_left_of_y("White","Green")
x_is_next_to_y("Blend","Cat")
x_is_next_to_y("Dunhill","Horse")
x_is_next_to_y("Blend","Water")
x_is_next_to_y("Norwegian","Blue")
x_is_in_n("Milk",3)
x_is_in_n("Norwegian",1)
inequality_within_group(nationalities)
inequality_within_group(drinks)
inequality_within_group(smokes)
inequality_within_group(pets)
inequality_within_group(house_color)
for (v in variable_names[1:25]) variable_bounds(v)
#Finally invoke the lp solve to solve the linear program
#and output results
library(lpSolve)
obj <- rep(0,79)
result <- lp("max", obj, constraints, dir, rhs, int.vec=1:25,binary.vec=26:79)
result <- result$solution[1:25]
result_table <- NULL
for (i in 1:5) {
  result_table <- cbind(result_table,
    variable_names[which(result==i)])
}
colnames(result_table) <- paste("House",1:5)
row.names(result_table) <- c("Nationality","Drinks","Smokes","Pet","House color")
result_table
             House 1     House 2 House 3    House 4      House 5 
 Nationality "Norwegian" "Dane"  "English"  "Sweede"     "German"
 Drinks      "Water"     "Tea"   "Milk"     "Beer"       "Coffee"
 Smokes      "Dunhill"   "Blend" "Pallmall" "Bluemaster" "Prince"
 Pet         "Cat"       "Horse" "Bird"     "Dog"        "Zebra" 
 House color "Yellow"    "Blue"  "Red"      "White"      "Green"

So the zebra is in house number 5 (green colored), where the Coffee drinking, Prince smoking German guy lives.

Notes

  1. There is a thorough list of coding implementations (with source code in various languages) of the Zebra Puzzle at the rosettacode.org site (see WILLNESS et al).
  2. See BERKELAAR (2015).
  3. See BERKELAAR (2010).

Bibliography

  • BERKELAAR, M. et al. (2010). lp_solve 5.5 reference guide: Absolute values. http://lpsolve.sourceforge.net/5.5/asolute.htm.
  • BERKELAAR, M. et al. (2015). Interface to 'Lp_solve' v. 5.5 to Solve Linear/Integer Programs. R package "lpSolve". http://cran.r-project.org/.
  • GOH, J. (2007) The Zebra eye [Digital visualization]. Retrieved from http://fav.me/d11cbfy.
  • R DEVELOPMENT CORE TEAM (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  • TAHA, H. (1982). Operations Research - An Introduction. 3rd Edition. New York: MacMillan Publishing Co.
  • WILLNESS et al. (2011). Zebra Puzzle. https://rosettacode.org/wiki/Zebra_puzzle.
    .
  • Zebra Puzzle. (2016, August 18). In Wikipedia, The Free Encyclopedia. Retrieved 17:01, October 1, 2016, from https://en.wikipedia.org/w/index.php?title=Zebra_Puzzle&oldid=735058353


If you found this post interesting or useful, please share it on Google+, Facebook or Twitter so that others may find it too.