Mostrando entradas con la etiqueta en. Mostrar todas las entradas
Mostrando entradas con la etiqueta en. Mostrar todas las entradas

miércoles, 5 de octubre de 2016

R Puzzle No. 2 - Where's the zebra?

For the second R puzzle, we will tackle the famous Zebra Puzzle, sometimes known as Einstein's Puzzle or Einstein's Riddle. There are various implementations of solutions to the puzzle in different programming languages1, but here we will be using R and an R package called lpSolve, which is intended to solve linear programming problems. I had a lot of fun preparing this entry and learned quite a bit on how to implement linear programming models using R - I hope you will find this post entertaining, interesting and useful.

The Puzzle


The Zebra Eye, by Singapore artist Jarrell Goh aka fuzzyzebra (2007).
The Zebra puzzle, or Einstein's riddle as it is often called, was popularized in a 1962 number of Life International magazine, with a follow-up in a March 1963 issue of Life containing the solution. It is said to have been invented by a young Einstein, although it is also credited to Lewis Carroll. The truth is, there is no clear evidence of either's authorship of the riddle. I should also point out that various versions of the puzzle exist, with different nationalities, brands of cigarettes, or even replacing sports teams with cigarettes, etc.

The version of the puzzle we will be using here is as follows. There are five houses along a street whose five occupants are all of different nationalities (English, German, Norwegian, Swedish, and Danish). Each occupant also smokes a different brand of cigarette (five brands in all), has a different favorite drink, has a different pet and owns a house of a different color than his neighbors. The following clues are given:


  • The Englishman lives in the red house.
  • The Swede has a dog.
  • The Dane drinks tea.
  • The man in the green house drinks coffee.
  • The white house on the left of the green house.
  • The man in the yellow house smokes Dunhill.
  • The man who smokes Pall Mall has a bird.
  • The German smokes Prince.
  • The man who drinks beer smokes Bluemaster.
  • The man that smokes Blends lives next to the man with a cat.
  • The man with a horse lives next to the one that smokes Dunhill.
  • The man who smokes Blends lives next to the man that drinks water.
  • The Norwegian lives next to blue house.
  • The man in the middle house (third house) drinks milk.
  • The Norwegian lives in first house.
If you have never tried to solve this puzzle, you should. It'll give you some moments of good entertainment. I suggest you make a table with 5 columns and 5 rows in which the columns represent the five houses, numbered 1-5 from left to right and the rows represent the nationalities, cigarette brands, pets, favorite drinks and house colors. Start with the most direct clues and incrementally reach the solution by using one clue at a time. Once you reach the solution, reflect on how you did it. What algorithm did you use? How did you represent the data? And finally, how would you program a computer (using R specifically) to solve this puzzle? This is when the real fun begins...

A mathematical approach

Like I said, there are various code implementations in various programming languages that you can find on the Internet. On the rosetta.org site (listed in the References section), there's even an R solution using an R-package called "combinat", which I confess I'm not familiar with but I assume it's similar to the Python "constraints" module. In this post I'll take a different approach (using linear programming), but first, let's lay down the basic strategy for solving this puzzle computationally.

If you tried to solve the puzzle by hand using a 5x5 table like the one I suggested, you may have probably considered the blank table cells as the problem variables, to be filled by a specific drink, nationality or other characteristic. If we are to solve this puzzle by coding it, we should take another approach: we should consider instead the 25 characteristics (brands of cigarettes, nationalities, house colors and so on) as the variables, each variable assuming an integer value between 1 and 5 representing the number of the house where that specific item is located. Doing so brings our puzzle within the realm of mathematics.

Without any restrictions, we simply assign to each of the 25 variables a number between 1 and 5. There are \(5^{25}\approx 2.98\cdot 10^{17}\) possible ways of doing so. If we next assume that there are no two houses with occupants of the same nationality, pets, brands of cigarettes, etc., then we narrow the solution space down to \((5!)^5=24,883,200,000\) possible cases, which is still a considerable number of cases. Each of the puzzle's clues is really an additional constraint and together, these constraints reduce the possibilities to a single definitive solution.

So once we have defined our 25 variables identifying them according to the specific item they represent, such as Horse, Norwegian, Tea, Red, and so on, how do we mathematically represent the constraints? Consider the first clue for example: the Englishman lives in the red house. This can be represented mathematically as \(English=Red\). Clues like "the occupant of the middle house (third house) drinks milk" can be expressed as \(Milk=3\). If the white house is to the right of the green house, we could simply write this as \(White=Green + 1\). If the Norwegian lives in the first house, we represent this by \(Norwegian=1\). The "x is next to y"-type clues are a bit trickier though.

What does "x is next to y" or "x and y are neighbors" mean? It simply means that their house numbers differ by one. So, for example, "the Norwegian lives next to the blue house" could be expressed mathematically as \(|Norwegian-Blue|=1\), which in turn means that either \(Norwegian-Blue=1\) or \(Blue-Norwegian=1\). Finally, since we require all variables to be bounded between 1 and 5, we could impose these constraints as inequalities: \(Red\geq 1\) and \(Red\leq 5\), for all 25 variables. All in all, what I mean to show you is that all constraints can be expressed as a system of equations and inequalities. More specifically, these equations and inequalities are all linear.

This is where linear programming comes in to the picture. In linear programming, we attempt to maximize or minimize an objective (linear) function, usually representing a cost as a function of the decision variables, when we impose constraints as a system of linear inequalities like \(Ax\leq b\) or \(Ax\geq b\). In linear programming, \(A_{n\times m}\) is a matrix of coefficients whose m columns represent the m decision variables and the n rows represent the n constraints as inequalities. On the right-hand side, b is a vector whose n components represent the right-hand side of each constraint inequality.

To give an idea about what a linear programming problem looks like, I'll use the following example from Taha (1982): A paint factory that produces interior and exterior paints uses two types of raw materials (A and B). The maximum availability of A is 6 tons per day, whereas for B it is 8 tons per day. To produce 1 ton of exterior paint, they require 1 ton of raw material A and 2 tons of raw material B, whereas for 1 ton of interior paint, they require 2 tons of A and 1 ton of B. Market surveys indicate that the daily demand for interior paint cannot exceed that of exterior paint by more than 1 ton. The survey also indicates that the maximum demand for interior paint is limited to 2 tons daily. Furthermore, the wholesale price per ton of interior paint is $2000 and for exterior paint it is $3000. How much interior and exterior paint should the factory produce to maximize gross income?

In the example problem given, it is clear that our two decision variables are \(X_I\) and \(X_E\) representing the daily production (in tons) of interior and exterior paint respectively. Then we could express the usage restrictions on the usage of raw materials and maximum raw material availability as follows:

\[\begin{align*} X_E + 2X_I &\leq 6 \quad \text{(raw material A)}\\ 2X_E + X_I &\leq 8 \quad \text{(raw material B)} \end{align*}\]
The demand restrictions (according to the market surveys), could be expressed as

\[\begin{align*} -X_E + X_I &\leq 1 \quad & \text{(the demand for interior paint cannot exceed}\\ {} & {}\qquad & \text{that of exterior paint by more than 1)}\\ X_I &\leq 2 \quad & \text{(the daily demand for interior paint is less than 2)} \end{align*}\]
And then of course there are the standard non-negativity restrictions on the decision variables that are tacitly assumed in any linear program: \(X_E, X_I \geq 0\). We want to maximize the gross income, and this income is given by \(z=3000X_E+2000X_I\), which is a linear function on the decision variables. All in all, our linear program would be:

Maximize \(z(X)=Z\cdot X\), subject to \(AX\leq B\), where \(X=(X_E,X_I)\), \(Z=(3000,2000)^T\), \(B=(6,8,1,2)^T\), and \(A\) is the following matrix of coefficients:

\[A=\left(\begin{array}{cc} 1 & 2\\ 2 & 1\\ -1 & 1\\ 0 & 1\\ \end{array}\right)\]

With \(X_E, X_I\geq 0\). Alternatively, in non-matrix form, our problem would be to maximize \(z=3000X_E+2000X_i\) subject to the following restrictions:

\[\begin{align*} 1X_E + 2X_I &\leq 6\\ 2X_E + 1X_I &\leq 8\\ -1X_E + 1X_I &\leq 1\\ 0X_E + 1X_I &\leq 2 \end{align*}\]
Again, subject to the non-negativity restrictions on \(X_E\) and \(X_I\).


The lpSolve package

It is not my intention in this post to give a thorough explanation of how the simplex algorithm used in linear programming works, nor delve into the principle of duality in linear programs. My intention is a more pragmatic and modest one- to show how to use one of the R packages dedicated to linear programming to solve such problems. Of course, knowing the inner workings of the mathematical tools could be useful in some cases, but in most real-life scenarios, it suffices to be able to identify the tool needed, to set up the problem in terms of the inputs that the tool requires, and to be able to assess and interpret the output. Specifically, we will make use of the "lpSolve" R package, which is an R wrapper to the lp_solve version 5.5 functions freely available under the LGPL 2 software license2. So, to continue with the example problem we defined above, the R code for solving it would be as follows:

1
2
3
4
5
6
7
8
library(lpSolve)
z <- c(3000,2000)
B <- c(6,8,1,2)
A <- matrix(c(1,2,2,1,-1,1,0,1),ncol=2,byrow=TRUE)
colnames(A) <- c("X_E","X_I")
dir <- rep("<=",4)
(result <- lp("max", z, A, dir, B))
result$solution
Success: the objective function is 12666.67
[1] 3.333333 1.333333
In the code above, we obtained a maximum gross income of $12666.67 with 3.33 tons of exterior paint (\(X_E\)) and 1.33 tons of interior paint (\(X_I\)). In line 4 we define the constraints matrix A. In the next line, we give names to the columns that correspond to the decision variables. This is not necessary, but the user must be aware that the solution vector will be given in terms of these variables in the order that they appear on the constraints matrix. Notice that the lp function call in line 7 requires we specify whether we want a maximum or minimum for the objective function, in this case given by the vector z defined in line 2. A neat feature of the lp function is that for each constraint row of the constraints matrix, we can specify a direction for the inequality as "<=", "=" or ">=", so not all restriction inequalities have to be of the same type.

Granted, our zebra puzzle is quite more complicated than this example problem. It involves at least 25 decision variables (the 5 nationalities, 5 pets, 5 colors, 5 cigarette brands and 5 drinks). But having seen how we mathematically expressed some of the restriction clues, it is clear that they all involve inequalities or equalities, which suits us fine because the lp function in lpSolve can handle any of those three types of restrictions for each constraint separately.

An important point to consider though, is that we are only interested in integer solutions for the decision variables. Although classic linear programming is intended to solve for real solutions on the decision variables, there are variants of the linear programming method, called integer linear programming, which restrict the solutions to integer values. Fortunately, lp can handle these cases and even the more restricted cases where the decision variable is binary (either 0 or 1). As we shall see, it is very easy to indicate which variables we want to be real, integer, or binary. So on that aspect of our zebra problem requiring that our variables be restricted to integer values (1,2,3,4 or 5) indicating the house where each item is, we have nothing to worry about.

Not so fast, cowboy

It would seem like just by invoking the lpSolve library, our zebra problem has been solved. But it isn't. For one thing, how can we efficiently and most simply represent the 25 decision variables and build the constraints matrix? Do we have to build this matrix manually, perhaps using a spreadsheet program and reading it in as a csv file from R? Instead of doing that, I decided to use a neat feature of R which allows me to index elements of a vector by using character-string names instead of integer numbers. Such vectors are called named vectors. So what I will do is first define a character vector with all the variable names. For each row in the restriction matrix, I will start with a numeric vector of 25 elements all set to 0 and name that vector with the variable names I defined previously. If I want to set a particular element of that constraint row to a certain value, I just simply index it by the variable name instead of using a number. To keep things simple as I show you how to do this, we will pretend that we only have four decision variables: "W","X","Y" and "Z".

To begin with, how would we define the constraint "X is in house 3"? As we have previously seen, that is tantamount to \(X=3\), so our constraint matrix row would be defined in R as follows:

variable_names <- c("W","X","Y","Z")
row1 <- rep(0,4)
names(row1) <- variable_names
row1["X"] <- 1
row1
 W X Y Z 
 0 1 0 0

The named vector row1 contains the coefficients for the constraint. The direction of that constraint would be "=", and the corresponding element of the right-hand side vector of the (in this case) equality would be 3. Now say we want to indicate that "W and Z are in the same house". That would be equivalent to stating that \(W=Z\) or \(W-Z=0\). In that case we would add a second row to are constraint matrix as follows:

row2 <- rep(0,4)
names(row2) <- variable_names
row2["W"] <- 1
row2["Z"] <- -1
row2
  W  X  Y  Z 
  1  0  0 -1

As in the previous case of row1, the direction (sign) of that constraint would again be "=" and the corresponding right-hand side value would be 0, since we want \(W-X=0\). You may be thinking that perhaps this is a complicated way to do this, but consider that with 25 variables per linear constraint where most of the coefficients are zero, it makes a lot of sense to set the coefficients only of one or two variables we are dealing with. In effect, we are seeing that the constraints matrix will be a sparse matrix (mostly filled with zeros). Constraints like "W is to the right of Y" are mathematically expressed as \(W-Y=1\) and it wouldn't be too hard to see how that constraint row would be defined in R.

Now when we come to a constraint like "X and Y are next to each other", that means that they differ by 1: \(|X-Y|\leq 1\). This absolute-value inequality results in two simultaneous inequalities: \(X-Y \leq 1\) and \(Y-X\leq 1\). These two inequalities can be easily represented as constraint rows in R as we have seen. The catch is that \(-1\leq X-Y \leq 1\) does not exclude the possibility that \(X-Y=0\), that is, that X and Y be equal. However if "X and Y are next to each other, they cannot occupy the same house, so to the \(|X-Y|\leq 1\) constraint we need to add \(X-Y\neq 0\). The problem is, how do we add a strict inequality constraint to a linear program when our only "direction" operators are "<=", "=" and ">="?

We want X and Y to be different (not equal), and since X and Y are limited to integer values, this is equivalent to stating that \(|X-Y|\geq 1\), which in turn means that either \(X-Y\geq 1\) or \(Y-X\leq -1\). However, we cannot have an "either/or" disjunction condition in a linear program, since all the conditions must be met simultaneously (through the "and" conjunction). We could use de Morgan's laws and state that transform these statements as I show below, , but we would just be going around in circles.

\[\begin{align*} X-Y\geq 1\;&\vee\;X-Y\leq -1 \equiv\\ \neg (X-Y\lt 1\;&\wedge\;X-Y\gt -1) \equiv\\ X-Y&\neq 0 \end{align*}\]
Fortunately, there's a "trick" that I discovered on the lp_solve reference guide web-page3 that one can use to incorporate constraints like \(|X-Y|\geq 1\) into a linear program. To generalize as much as possible in what follows, we will assume that \(X=a_1X_1 + \ldots + a_mX_m\) is a linear expression on the decision variables. We want to be able to express \(|X|\geq min\) as a valid linear programming constraint. The trick is to introduce an additional binary decision variable \(B\) that can only assume two values: 0 or 1.

\[\begin{align*} X+M\cdot B &\geq min\qquad (1) \\ -X+M(1-B) &\geq min \end{align*}\]
In the above, \(M\) is simply a big, positive constant. How big does it have to be is something we will discover shortly. For now, bear with me and let's see what happens if \(B=0\). In that case, the inequalities in (1) become:

\[\begin{align*} X&\geq min \qquad (1a)\\ -X+M &\geq min \end{align*}\]
In this case (\(B=0\)), \(X\geq min\) and since we are assuming that M is a large enough constant, the second condition will always be fulfilled. If \(B=1\), then (1) is changed to the following:

\[\begin{align*} X+M&\geq min \qquad (1b)\\ -X&\geq min \end{align*}\]
In that case (\(B=1\)), the condition \(-X \geq min\) must hold, which can only mean that \(-X\) is positive (X was negative). The first condition in (1b) automatically holds because we have been assuming that M is large enough anyways. So how large is "large enough"? Well, from either (1a) or (1b), we can see that \(M\geq min+|X|\). While we could in theory use a really large value for M like \(1\cdot 10^{20}\), it is not advisable to do so because that would create numerical instabilities with our linear program solver.

If we can somehow put an upper bound to \(|X|\) in our problem, then we can find a feasible value for M. In our zebra puzzle, say we have two variables like "Norwegian" and "Blue" and we want to express that these have different values \(|Norwegian-Blue|\geq 1\). By looking at the expression within the absolute value brackets \(Norwegian-Blue\), we would have to see how large the difference can become. Since all decision variables in our problem are within the 1-5 range, we can see that this difference, in absolute value, cannot be larger than 4. So that means that any value of M larger than \(4+1=5\) will suffice. In my R solution, I'll be using a value of 20 for M. At any rate, a "x not equal to y constraint" implies using one additional binary decision variable and two additional constraint rows, as shown in (1), in the linear program. Now we are ready to tackle the zebra puzzle in R, so without further ado, here's the script.

And finally, the R solution

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
#define the variables, each assuming integer values between 1 and 5
nationalities <- c("Dane","English","German","Norwegian","Sweede")
drinks <- c("Beer","Coffee","Milk","Tea","Water")
smokes <- c("Blend","Bluemaster","Dunhill","Pallmall","Prince")
pets <- c("Bird","Cat","Dog","Horse","Zebra")
house_color <- c("Blue","Green","Red","White","Yellow")
variable_names <- c(nationalities,drinks,smokes,pets,house_color)
#add binary variables for defining inequalities:
#4 for the problem constraints
#50 for requiring that the variables within each group
#(colors, nationalities, pets, etc.) have to be different
variable_names <- c(variable_names,sprintf("B%02d",1:54))
#define maximum bound for inequalities
M <- 20
#define Binary variable counter
b_counter <- 1
#define blank constraints, dir and rhs
constraints <- NULL
dir <- character(0)
rhs <- numeric(0)
#functions for defining constraints
x_is_y <- function(x,y) {
  #x and y are in the same house
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,0)
}
x_is_in_n <- function(x,n) {
  #x is in house number n
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,n)  
}
x_to_left_of_y <- function(x,y) {
  #x is on the house to the right of y
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  row.names(constraints) <<- NULL
  dir <<- c(dir,"=")
  rhs <<- c(rhs,-1)  
}
x_is_not_y <- function(x,y) {
  #function that defines x and y as not equal
  #1st inequality constraint
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  B_variable <- sprintf("B%02d",b_counter)
  new_row[x] <- 1
  new_row[y] <- -1
  new_row[B_variable] <- M
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1)
  #2nd inequality constraint
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- -1
  new_row[y] <- 1
  new_row[B_variable] <- -M
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1-M)
  row.names(constraints) <<- NULL
  b_counter <<- b_counter+1  
}
x_is_next_to_y <- function(x,y) {
  #x is next to y (x and y are neighbors)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  new_row[y] <- -1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,1)
  #2nd neighboring constraint row
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- -1
  new_row[y] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,1)
  #now define inequality constraints
  x_is_not_y(x,y)
}
inequality_within_group <- function(var_group) {
  #function that defines all variables within a group
  #to be different from each other.
  #first obtain all possible pair of group variable combinations
  group_combinations <- t(combn(var_group,2))
  #now loop through this table, adding constraints
  for (i in 1:nrow(group_combinations)) {
    x_is_not_y (group_combinations[i,1],group_combinations[i,2])
  }
}
variable_bounds <- function(x) {
  #function that creates constraints for bounds of 
  #problem variable (must be in 1-5 range)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,">=")
  rhs <<- c(rhs,1)
  new_row <- numeric(length(variable_names))
  names(new_row) <- variable_names
  new_row[x] <- 1
  constraints <<- rbind(constraints,new_row)
  dir <<- c(dir,"<=")
  rhs <<- c(rhs,5)
  row.names(constraints) <<- NULL
}
#Now define constraints in Prolog fashion
x_is_y("English","Red")
x_is_y("Sweede","Dog")
x_is_y("Dane","Tea")
x_is_y("Coffee","Green")
x_is_y("Dunhill","Yellow")
x_is_y("Pallmall","Bird")
x_is_y("German","Prince")
x_is_y("Bluemaster","Beer")
x_to_left_of_y("White","Green")
x_is_next_to_y("Blend","Cat")
x_is_next_to_y("Dunhill","Horse")
x_is_next_to_y("Blend","Water")
x_is_next_to_y("Norwegian","Blue")
x_is_in_n("Milk",3)
x_is_in_n("Norwegian",1)
inequality_within_group(nationalities)
inequality_within_group(drinks)
inequality_within_group(smokes)
inequality_within_group(pets)
inequality_within_group(house_color)
for (v in variable_names[1:25]) variable_bounds(v)
#Finally invoke the lp solve to solve the linear program
#and output results
library(lpSolve)
obj <- rep(0,79)
result <- lp("max", obj, constraints, dir, rhs, int.vec=1:25,binary.vec=26:79)
result <- result$solution[1:25]
result_table <- NULL
for (i in 1:5) {
  result_table <- cbind(result_table,
    variable_names[which(result==i)])
}
colnames(result_table) <- paste("House",1:5)
row.names(result_table) <- c("Nationality","Drinks","Smokes","Pet","House color")
result_table
             House 1     House 2 House 3    House 4      House 5 
 Nationality "Norwegian" "Dane"  "English"  "Sweede"     "German"
 Drinks      "Water"     "Tea"   "Milk"     "Beer"       "Coffee"
 Smokes      "Dunhill"   "Blend" "Pallmall" "Bluemaster" "Prince"
 Pet         "Cat"       "Horse" "Bird"     "Dog"        "Zebra" 
 House color "Yellow"    "Blue"  "Red"      "White"      "Green"

So the zebra is in house number 5 (green colored), where the Coffee drinking, Prince smoking German guy lives.

Notes

  1. There is a thorough list of coding implementations (with source code in various languages) of the Zebra Puzzle at the rosettacode.org site (see WILLNESS et al).
  2. See BERKELAAR (2015).
  3. See BERKELAAR (2010).

Bibliography

  • BERKELAAR, M. et al. (2010). lp_solve 5.5 reference guide: Absolute values. http://lpsolve.sourceforge.net/5.5/asolute.htm.
  • BERKELAAR, M. et al. (2015). Interface to 'Lp_solve' v. 5.5 to Solve Linear/Integer Programs. R package "lpSolve". http://cran.r-project.org/.
  • GOH, J. (2007) The Zebra eye [Digital visualization]. Retrieved from http://fav.me/d11cbfy.
  • R DEVELOPMENT CORE TEAM (2009). R: A Language and Environment for Statistical Computing. R Foundation for Statistical Computing, Vienna, Austria. ISBN 3-900051-07-0. http://www.R-project.org.
  • TAHA, H. (1982). Operations Research - An Introduction. 3rd Edition. New York: MacMillan Publishing Co.
  • WILLNESS et al. (2011). Zebra Puzzle. https://rosettacode.org/wiki/Zebra_puzzle.
    .
  • Zebra Puzzle. (2016, August 18). In Wikipedia, The Free Encyclopedia. Retrieved 17:01, October 1, 2016, from https://en.wikipedia.org/w/index.php?title=Zebra_Puzzle&oldid=735058353


If you found this post interesting or useful, please share it on Google+, Facebook or Twitter so that others may find it too.

jueves, 22 de septiembre de 2016

Population Growth Models using R/simecol, Part 1 : The world population

In this entry, as part of the series on dynamic systems modeling with R and simecol, we'll take a look at population growth models, our main focus being on human population growth models and how they tie into other theoretical frameworks such as demographic transition theory and carrying capacity. These different models are then fitted to data on world population spanning a period between 10000 BC up to 2015, using R and simecol. In this latter part we will finally introduce the FME R package.

Malthus model of population growth

In 1798 the Reverend Thomas Malthus published An Essay on the Principle of Population, sparking a debate about population size and its effects on the wealth of nations. His Essay is framed within the wider context of considerations about the perfectibility of the human condition and it was Malthus' response to the likes of Condorcet and others who argued, in the typical positivist style of the era, that human reason would eventually surmount all obstacles in the way of mankind's progress.

At the core of the Malthusian argument was the observation that populations grow exponentially but the means of physical sustenance can only increase in arithmetic progression at best, and since population growth is commensurate to the availability of means of sustenance, any momentary improvement in the living condition of the human masses that would lift the restrictions on human population growth would inevitably result in long term famine and aggravation of man's plight. In modern day system dynamics terminology, we would say that population growth has a negative feedback mechanism that prevents further population growth and this idea in itself is not without justification. I mean, the planet cannot sustain exponential human population growth indefinitely, can it?

Malthus was in fact aware of negative feedback mechanisms and cyclical behavior in population growth. To paraphrase Malthus: the increase in population meant a decrease in wages and at the same time, an increase in food prices (less food for more population). During this time, the difficulties in rearing a family are so great that population growth is at a stand. But at the same time, a decrease in the price of labor meant that farmers could employ more laborers to work the land and increase food production thus loosening population growth restraints. The same retrograde and progressive movements in human well-being ocurred again and again- a cycle of human misery and suffering1. Being a clergyman, this situation was seen by Malthus as "divinely imposed to teach virtuous behavior"2.


At any rate, what is known today as the Malthus model for population growth, given in (1), only considers the exponential aspect of growth, as given by the rate parameter \(r\). No doubt the Malthus model is the simplest population growth model. It characterizes the growth of a population as dependent only upon the reproductive rate \(r\) and the population level at any given time (\(N\)). Note that the negative feedback controls we spoke of earlier are conspicuously missing in this model. In this model, there is only positive feedback: higher population levels beget even higher population levels, and so on to infinity.

\[\frac{dN}{dt}=rN\qquad\qquad(1)\]

Verhulst or Logistic Model of Population Growth

Considering the ideas on population growth that Malthus expounded in his Essay, Adolphe Quetelet judged that Malthus failed to provide a mathematical explanation as to how population levels, whose growth was by nature exponential, would top off at some point and not continue to rise to infinity3. Quetelet himself and his pupil, Pierre Verhulst, would assume this task in their effort to apply the methods of physics, which had seen a huge success with Newton's work, to the social sciences, as was characteristic of Positivist thought in the nineteenth century4.

Making an analogy with physics, Quetelet argued that in addition to the principle of geometric growth of the population, there was another principle at work according to which "the resistance or the sum of the obstacles encountered in the growth of a population is proportional to the square of the speed at which the population level is increasing"5. This premise bears semblance to the mathematical description of an object falling through a dense fluid: the deeper the object descends into this fluid, the greater the density, and hence the resistance to its downward trajectory, until the object reaches a depth beyond which it can't descend further6. By applying this analogy to demographic growth, we can infer that there is a maximum population level. As a result, a population finds the cause of its eventual equilibrium in its own growth7. In modern literature, this model of population growth is given by the following differential equation:

\[\frac{dN}{dt}=r_{max}N\left(1-\frac{N}{K}\right)\qquad\qquad(2)\]
Let us examine this equation in more detail to understand its behavior. For small population levels, when \(N\) is much smaller than \(K\), \(N\) exhibits exponential like growth, as dictated by the \(r_{max}\) parameter, also known as the intrinsic growth rate. This is because the \(\left(1-N/K\right)\) term is still relatively close to 1. However, as \(N\) becomes larger, this last term starts to matter because it approaches zero. As \(N(t)\) attains this point, the growth given by the \(dN/dt\) differential peters out and the population levels top off asymptotically at \(K\). This assymptotical level \(K\) is known as the saturation point or the carrying capacity of the population. We can integrate equation (2) to obtain \(N(t)\), whose graph is as follows:

KK/2N₀t
Functions whose curve has this characteristic S-shape like the Verhulst model of population growth are called sigmoid functions8. We encounter sigmoid functions in many contexts: as the cumulative probability distribution function of the normal and T-Student distributions, as the hyperbolic tangent function and as the activation function of neurons in mathematical neural network models. Then of course there's also the logistic family of functions, which occur in statistics and machine learning contexts as well as in chemistry, physics, linguistics and sociology (in the latter two contexts the logistic function is used to model language or social innovation/adoption processes). Hence the alternate denomination for the Verhulst model of population growth as the logistic model of population growth. Another feature of this model is that it has an inflection point - after the \(N\) reaches \(K/2\), the rate of population growth starts to decelerate until it eventually levels off to zero. At this halfway inflection point, the differential rate of growth is highest and curve has its "steepest" tangent line. Notice the symmetry of the curve around this inflection point.

The beauty of the logistic model of population growth lies in its simplicity (only two parameters) and the interpretability of its parameters. The intrinsic growth rate (parameter \(r_{max}\)) is the rate of exponential growth when the population is small and the carrying capacity parameter \(K\) is simply the maximum population level attainable. One important disadvantage of the logistic model is the fact that its inflection point is exactly half the carrying capacity \(K\), which severly limits the applicability of this model. Nonetheless, Verhulst himself used this model to fit census data in France (1817-1831), Belgium (1815-1833), Essex County (1811-1831) and Russia (1796-1827), all with relative success9.

Variants of the logistic population growth model

In seeking to improve the applicability of the basic logistic model for population growth, many authors have since proposed models with more parameters that still retain the basic sigmoid features of the logistic model and include one inflection point. Given the inflexibility of the basic logistic growth model about the inflection point, Blumberg introduced what he called the hyperlogistic function whose derivative is given by (3), which we will in turn call the Bloomberg model of population growth:

\[\frac{dN}{dt}=r_{max}N^\alpha\left(1-\frac{N}{K}\right)^\gamma\qquad\qquad(3)\]
The Blumberg model introduced two additional parameters to surmount problems raised by treating the intrinsic growth rate as a time-dependent polynomial10, while at the same time contributing some measure of pliancy to the inflection point, given by (4). I would add that equation (3) cannot always be integrated - Blumberg cataloged the various conditions on \(\alpha\) and \(\gamma\) that result in analytic solutions to (3) when explicit integration can be carried out.

\[N_{inf}=\frac{\alpha}{\alpha+\gamma}K\qquad\qquad(4)\]
The Richards growth model, originally developed to fit empirical plant mass data, is given by:

\[\frac{dN}{dt}=r_{max}N\left(1-\left(\frac{N}{K}\right)^\beta\right)\qquad\qquad(5)\]
\[N_{inf}=\left(\frac{1}{1+\beta}\right)^{\tfrac{1}{\beta}}K\qquad\qquad(6)\]
The inflection point in the Richards model is given by (6). Note that for \(\beta=1\), the Richards model is the original logistic growth model with the same inflection point. For extreme cases of \(\beta\), we have \(N_{inf}=e^{-1}K\) when \(\beta\rightarrow 0\) and \(N_{inf}=K\) when \(\beta\rightarrow \infty\).

A further (and logical) generalization of the Blumberg and Richards models leads us to the so-called generalized logistic growth function with five parameters11 (equation 7), where \(\alpha\), \(\beta\) and \(\gamma\) are all real positive numbers. The inflection point for this model is given by (8), which makes sense as long as \(N_0 < N_{inf}\). If \(N_0 > N_{inf}\), then \(N_{inf}\) will never be attained, because the intrinsic growth rate is assumed to be positive.

\[\frac{dN}{dt}=r_{max}N^\alpha\left(1-\left(\frac{N}{K}\right)^\beta\right)^\gamma\qquad\qquad(7)\]
\[N_{inf}=\left(1+\frac{\beta\gamma}{\alpha}\right)^{-\tfrac{1}{\beta}}K\qquad\qquad(8)\]
Yet another growth model similar to the logistic growth model is the Gompertz growth model12. The Gompertz model has been used to describe exponential decay in mortality rates, the return on financial investments13 and tumor growth. It's equation takes the following form:

\[\frac{dN}{dt}=r_{max}N\,\log\left(\frac{N}{K}\right)\qquad\qquad(9)\]
An interesting variant for this model's representation is to use two stock variables, one representing the mass or population that's growing (\(N\)) and the other representing the diminish rate or growth (\(r\)). So, we would have two differential equations14:

\(\frac{dr}{dt}=-kr\)(10)
\(\frac{dN}{dt}=rN\)
It must be noted that all the models above imply exponential growth from the first time instant, where in fact, the effective exponential rate of growth is highest, as the population or mass levels are still far from attaining the carrying capacity. In some situations, it would seem that some populations begin in a "dormant" stage, where they are not exhibiting this sort of exponential growth. Such may be the case, for example, in societies before the industrial revolution or even before the agricultural revolution, when scarcity was the order of the day and high birth rates were decompensated with high mortality rates, resulting in population numbers that grew ever so slowly.

While going over the literature in writing this post, I found an interesting two-stage population growth model in Petzoldt (2008). He describes the situation in which a microorganism culture is dormant while it is adapting to its environment, after which it begins to reproduce and increase in numbers according to the logistic growth mechanism outlined above. In other words, in this model, growth is delayed or lagged. Petzoldt's two stage model for population growth is in fact a two-compartment model (see my entry on epidemological models, where I discuss compartment models), with one compartment representing the part of the population in the "dormant" stage and the other representing the "active" population:

\[\frac{dN_d}{dt}=-r_1N_d\](11)
\[\frac{dN_a}{dt}=r_1N_d+r_2N_a\left(1-\frac{N_a}{K}\right)\]

Hyperbolic models of population growth

Some authors have pointed out that logistic population growth models, although fairly accurate for short-term growth of populations on a scale of centuries, fail to accurately describe the long-term population growth of the entire human race on a scale of thousands of years. Namely, the logistic models do not account for the fact that the world's population has grown from 2 billion to 7 billion in the last 50 years alone, in comparison to which the population growth curve seems almost flat for time periods in the remote past. In a seminal work by Von Foerster, Mora and Amiot15, they proposed that according to the available data by 1960, the world's population growth could be described by the the following model:

\[\frac{dN}{dt}=\frac{C}{(T_c-t)^2}\qquad\qquad(12)\]
This model has one essential flaw: as \(t\) approaches \(T_c\), the denominator approaches zero and the population shoots up towards infinity. Even when modeling the world's population, this model predicts an ever increasing population rate when in fact, in 1962, the world population growth rate \(N\cdot dN/dt\) peaked at 2.1% and has since been decreasing16. To avoid the singularity predicted by Von Foerster et al's model, the Russian physicist S.P. Kapitsa suggested the following approach:

\[\frac{dN}{dt}=\frac{C}{(T_c-t)^2 + \tau^2}\qquad\qquad(13)\]
The differential equation in (13) can be easily integrated, resulting in:

\[N(t)=\frac{C}{\tau}arccot\left(\frac{T_c-t}{\tau}\right)\qquad\qquad(14)\]
With (14), Kapitsa fitted his model to the world population data available at that time and found that \(C=(186\pm 1)\cdot 10^9\), \(T_c=2007\pm 1\), and \(\tau=42\pm1\). The parameter \(\tau\) represents, according to Kapitsa, the average lifespan of a human generation.

Models with variable carrying capacity

The logistic models covered so far implicitly assume that the carrying capacity is a constant, presumably dependent on the environment's capacity to sustain a species population. While this might be plausible with different populations of plants and animal species, the human species sets itself apart from the others in one important aspect: its unprecedented and unmatched capacity to significantly alter the environment (and indeed the planet) to suit its purposes.

As a first approach, we could consider that the carrying capacity of the earth is gradually expanded by a growing human population, an idea grounded on the observation that a greater number of people imply greater food productivity and eventually, a greater number of inventions to amplify the productivity of one individual. We would have the following model for population growth:

\[\frac{dN}{dt}=rN\left(1-\frac{N}{K}\right)\](15)
\[\frac{dK}{dt}=\gamma K N\]
In the first equation of (15) we can easily spot our old friend, the Verhulst logistic model. This time however, as the second equation of (15) implies a growing capacity always one step ahead of the population, it doesn't take much to convince ourselves that this model implies a runaway an unbounded population. In fact, it can be shown that prior to the critical time point \(t_c\), this model behaves very much like the hyperbolic model described by equation (12)17. Still, we intuitively know that there must be some physical limits for the earth's population and that human technological innovation cannot push the carrying capacity indefinitely towards infinity. For example, we have already pointed out that since 1962, the relative population growth rate \(1/N\cdot dN/dt\) has started to gradually to decrease. This occurred in the same time as the so-called Green Revolution, which brought about significant increases in food productivity and crop yields18. However, the Green Revolution did not bring about a surge in the relative population growth rate and quite the opposite occurred - the world population's growth seems to be slowing down.

At any rate, there is no reason to suppose that the carrying capacity remains constant throughout human history, and surely, there must be positive and negative influences on the carrying capacity. On the one hand, the idea that each person contributes to the growth in human carrying capacity is not to be discarded entirely. On the other hand, the contribution of each additional person depends on the resources available, and these resources must be shared among more people as the population increases. What is being described here can be summarized in the following population-growth model proposed by Cohen19, which he called the Condorcet-Mill model, after the British philosopher John Stuart Mill (1806-1873) who is cited as foreseeing that a stationary population is both inevitable and desirable.

\[\frac{dN}{dt}=rN\left(1-\frac{N}{K}\right)\](16)
\[\frac{dK}{dt}=\frac{L}{N}\frac{dN}{dt}\]

Introducing the FME package and some comments on the world population data we'll be using

We have finally come to the part of this blog post in which we will try to fit the world population data to the various population growth models covered. The data was obtained fro the ourworldindata.org site20 and cosists of a tabular csv file with two columns, which i have renamed "time" (the year) and "pop" (population). This data covers a wide time-span, from 10,000 BC until 2015 and to be perfectly honest, I cannot vouch for the validity of the population counts in the remote past nor do I have any idea as to how they obtained census data for this time period. I'll just assume that the population data for the pre-historic and ancient time periods is fairly accurate. At any rate, this large time-scale population data allows us to compare the family of models based on logistic population growth to the singular approach by Kapitsa.

For the model fitting process, we will use an R-package called FME (FME stands for Flexible Modelling Environment). FME introduces added functionalities over the simecol package to aid in the process of system dynamics model building:

  • FME's modFit incorporates additional non-linear optimization methods for parameter fitting over simecol's fitOdeModel: "Marq" for the Levenberg-Marquardt algorithm, which is a least squares method, "Pseudo" for the pseudo-random search algorithm, "Newton" for a Newton-type algorithm, and "bobyqua" for a derivative-free optimization using quadratic approximations.
  • Global sensitivity analysis, which consists of assessing the change of certain model output variables according to changes in parameters over a large area of the parameter space, is done in FME via the sensRange function.
  • The sensFunc function in the FME package provides a way to define sensitivity functions in order to perform a Local sensitivity analysis, which consists of studying the effects of one parameter when it varies over a very small neighboring region near its nominal value.
  • The sensitivity functions are also used by FME to estimate the collinearity index of all possible subsets of the parameters. This has to do with the identifiability of the parameters in the model, a concept well worth delving into.
The nonlinear optimization methods used to fit ODE models like the ones I have been dealing with in this series of posts are very different from methods like the ordinary least squares used in linear regression, for example. The ordinary least squares method, on account of being linear on the model parameters, will always come up with a unique estimation of these parameters. The least squares estimation in linear regression is unique because there's only one global minimum in the parameter search space. In the case of nonlinear optimization problems like the ones that we deal with in ODE models, however, the parameter search space is much more complex and there may be many many local minima. This situation substantially complicates ODE model fitting, particularly when the model has several parameters.

When one is fitting ODE models, different initial "guesses" for the parameter values will result in entirely different model parameters and indeed, in an entirely behavior of the model's variables. To determine just how much do small changes in the initial parameter values affect the model's variables is the reason why we perform sensitivity analysis on the models. It may occur that some parameters are linearly or almost linearly dependent on others, and just like in the linear regression case, this multicollinearity negatively affects the identifiability of the model's parameters. In particular, the colinearity index given by the FME collin function gives a measure of the parameter set's identifiability based on the data we use to fit the model.

The collin function provides the collinearity index for all possible subsets of a model's parameters. The larger the collinearity index for any given set of parameters, the less identifiable those parameters are. As a rule of thumb, a parameter set with a collinearity index less than 20 is identifiable21.

Like sands through the hourglass, so are the days of our lives

We now begin the process of fitting various population models to the world population data described above. We will begin with the classic (Verhulst) Logistic growth model with 2 parameters:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
library("simecol")
library("FME")
#Read census data
census <- read.table("world_pop.csv",header=TRUE,sep=",")
#Verhulst logistic growth model
verhulst_model <- new("odeModel",
        main=function(time,init,parms, ...) {
            x <- init
            p <- parms
            dpop <- x["pop"]*p["r"]*(1-(x["pop"]/p["K"]))
            list(dpop)},
        init=c(pop=2431214),
        parms=c(r=1e-02,K=1e10),
        times=seq(from=-10000,to=2015,by=1),
        solver="lsoda")

#cost function
fCost <- function(p) {
    parms(verhulst_model) <- p
    out <- out(sim(verhulst_model))
    return(modCost(out, census, weight="std"))
}

#model fitting
result <- modFit(fCost, parms(verhulst_model),
                 lower=c(r=0,K=2431214),upper=c(r=1,K=1e+12))
(parms(verhulst_model) <- result$par)
cat("SC : ",ssqOdeModel(NULL,verhulst_model,census$time,census[2]),"\n")
#identifiability analysis
sF <- sensFun(func = fCost, parms = parms(verhulst_model))
(colin <- collin(sF))
The model's output - parameter estimations, sum of squares measure of goodness of fit as calculated in line 28 of the above code, and the collinearity index of the parameter set (produced by lines 30-31 of the code) - is given below, along with the plot for the real population curve (in red) and the fitted model (in cyan). I have not included the source code for generating these plots, but could do so if any reader is interested.
           r            K 
6.065198e-04 9.999754e+11 
SC :  92.33517
  r K N collinearity
1 1 1 2          7.9


Let's interpret the results above. First of all, with a collinearity index of 7.9, the model's parameters are identifiable, so that's not an issue. An inspection of the r and K parameters reveal that the Verhulst Logistic growth model predicts a maximum world population of almost \(1\cdot 10^{12}\) or one trillion inhabitants. Considering we're currently only at 8 billion inhabitans, we would have a long way to go before we reach that saturation point. One look at the graph reveals that this model is simply not adequate. The model's curve seems to bulge too much above the real population curve, indicating that this model predicts too much population growth at remote times in the past when we know that the real growth was barely noticeable. All the while, the model predicts a population of 3-4 billion in 2015 when we know the real population level to be over 7.3 bilion this year. So we have two major flaws with this model: 1) it predicts the modern, explosive, population growth to begin much sooner than it really did and 2) in modern times (1800-present), the model's predicted growth is simply not fast enough compared to the real data. These flaws are confirmed by the model's high sum of squares value of 92.34.

One of the limitations of the classic logistic growth model is the inability to "tweak" the inflection point. In this case, with the model's K value estimated at 1 trillion, the inflection point would be half of that, or 500 billion - too far into the future. We know for a fact that the increase in the world population growth rate started to slow down during the 1960's, so what we would need is a logistic growth model in which the inflection point can be closer to the maximum population levels. We will next attempt to fit the generalized logistic growth model, which would, in theory, afford us the flexibility in tweaking with the inflection point.

The problem with the generalized logistic growth model is the large number of parameters (5) with just one differential equation. This results in a very complex parameter space with potential problems in parameter identifiability. To explore this situation, I decided to sample one hundred initial parameter guess values (N=100) using the uniform probability distribution with a given range for each parameter and storing these initial values in a data-frame (see code snippet below), from where I proceeded to use them as initial values for the model fitting procedure.

1
2
3
4
5
parameters <- data.frame(   r=runif(N,min=0,max=0.1),
                          K=runif(N,min=7349472099,max=1e+11),
                          alpha=runif(N,min=0.1,max=1.2),
                          beta=runif(N,min=0.1,max=2),
                          gamma=sample((1:10),size=N,replace=TRUE) )
I must comment on the reason why I sampled only integer values for the gamma (\(\gamma\)) parameter. Taking a look at model equation (7), we see that if the expression within parentheses on the right hand side that's raised to the gamma exponent ever becomes negative, this would raise numerical error issues because we'd be raising a negative base to a fractional exponent. While there are possibly ways around this issue, it was simpler for me to just use integer values for gamma. Now when fitting the generalized logitic model using these 100 different initial parameter sets, I obtained different parameter estimations with different sum of squares goodness of fit values, which I wrote to a csv file with the following code:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
library("simecol")
library("FME")
#Read census data
census <- read.table("world_pop.csv",header=TRUE,sep=",")
#Generalized logistic growth model
generalized_model <- new("odeModel",
        main=function(time,init,parms, ...) {
            x <- init
            p <- parms
            dpop <- x^p["alpha"]*p["r"]*(1-(x/p["K"])^p["beta"])^p["gamma"]
            list(dpop)},
        init=c(pop=2431214),
        parms=c(r=1e-08,K=7349472099,alpha=1,beta=1,gamma=4),
        times=seq(from=-10000.0,to=2015.0,by=1),
        solver="lsoda")

#cost function
fCost <- function(p) {
    parms(generalized_model) <- p
    out <- out(sim(generalized_model))
    return(modCost(out, census, weight="std"))
}
parameters <- read.table("initial_parms.csv",sep="\t",header=TRUE)
parameters <- as.matrix(parameters)
for (i in 1:N) {
  #get initial parameter estimations and fit the model
  parms(generalized_model) <- parameters[i,]
  options(show.error.messages = FALSE)
  try(
    fitted_model <- modFit(fCost, parms(generalized_model) ,
      lower=c(r=0,K=7349472099,alpha=0.1,beta=0.1,gamma=1),
      upper=c(r=0.1,K=1e+11,alpha=1.2,beta=2,gamma=20) )
  )
  options(show.error.messages = TRUE)
  parms(generalized_model) <- fitted_model$par
  r <- as.numeric(fitted_model$par["r"])
  K <- as.numeric(fitted_model$par["K"])
  a <- as.numeric(fitted_model$par["alpha"])
  b <- as.numeric(fitted_model$par["beta"])
  g <- as.numeric(fitted_model$par["gamma"])
  Ninf <- as.numeric((1+b*g/a)^(-1/b)*K)
  output <- out(sim(generalized_model))
  SC <- ssqOdeModel(NULL,generalized_model,census$time,census[2])
  results <- rbind(results,
                      data.frame(   r=r,K=K,alpha=a,beta=b,gamma=g,
                                  Ninf=Ninf,SC=SC) )
  cat(i," ",SC,"\n")
} 
Choosing the parameter set with the least sum-of-squares (SC) value, I ran the simulation of the model and obtained the following graph and the following collinearity indexes:

           r           K   alpha     beta    gamma       N_inf       SC
1.836653e-05 99966063867 1.19538 1.996857 1.009536 60943709007 76.60192

   r K alpha beta gamma N collinearity
1  1 1     0    0     0 2          5.8
2  1 0     1    0     0 2        214.0
3  1 0     0    1     0 2          6.2
4  1 0     0    0     1 2          5.8
5  0 1     1    0     0 2          6.0
6  0 1     0    1     0 2         81.3
7  0 1     0    0     1 2       6442.9
8  0 0     1    1     0 2          6.3
9  0 0     1    0     1 2          5.9
10 0 0     0    1     1 2         81.0
11 1 1     1    0     0 3        403.1
12 1 1     0    1     0 3        131.9
13 1 1     0    0     1 3       9122.3
14 1 0     1    1     0 3        425.7
15 1 0     1    0     1 3        403.4
16 1 0     0    1     1 3        133.0
17 0 1     1    1     0 3        133.7
18 0 1     1    0     1 3       9018.9
19 0 1     0    1     1 3       6751.7
20 0 0     1    1     1 3        134.9
21 1 1     1    1     0 4        608.5
22 1 1     1    0     1 4      14594.6
23 1 1     0    1     1 4      11876.4
24 1 0     1    1     1 4        604.7
25 0 1     1    1     1 4      11704.0
26 1 1     1    1     1 5      14777.0


The fit of the generalized logistic model is slightly better, as can be seen from the sum-of-squares value of 76.6 in comparison to the Verhulst SC of 92.34. The models curve on the graph is bending down slightly closer to the data values. However, the asymptotic population level is still unrealistically high at 1 trillion inhabitants and so is the population at the inflection point (at about 610 billion souls). As in the previous model, the population levels at 2015 are underestimated and the modern-era type growth begins much sooner (according to the model), than it really did. Notice also that the collinearity indexes for any combination of more than 2 parameters indicate that this model was nowhere near adequate on account of poor parameter identifiability.

I then attempted to fit Petzoldt's two-compartment model to see if it would be capable of retarding the modern-era type hyperexponential growth while at the same time not underestimating population levels during the third millenium. The results were the following (code and graphics included):

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
petzoldt_model <- new("odeModel",
  main=function(time,init,parms, ...) {
    with(as.list(c(init,parms)), {
      dy1 <- -a * y1                          # "inactive" part of population
      dy2 <-  a * y1 + r * y2 * (1 - y2 / K)  # growing population
      list(c(dy1, dy2), pop = y1 + y2)        # total pop = y1 + y2
    })
  },
  init=c(y1=2431214, y2=0),
  parms=c(r=1e-04,K=7349472099,a = 1e-05),
  times=seq(from=-10000.0,to=2015.0,by=1),
  solver="lsoda")

## cost function
fCost <- function(p) {
  parms(petzoldt_model) <- p
  out <- out(sim(petzoldt_model))
  return(modCost(out, census, weight="std"))
}

## fit model (all parameters)
result <- modFit(fCost, parms(petzoldt_model),
                 lower=c(r=0,K=7349472099,a=0),
                 upper=c(r=1,K=1e+12,a=1),
                 method="Port",
                 control=list(scale=c(r=1,K=1e-12,a=1)) )
                
(parms(petzoldt_model) <- result$par)
cat("SC : ",result$ssr,"\n")
#identifiability analysis
sF <- sensFun(func = fCost, parms = parms(petzoldt_model))
(colin <- collin(sF))
           r            K            a 
6.459584e-04 7.349472e+09 1.685916e-03 

SC :  115.537 

  r K a N collinearity
1 1 1 0 2          6.0
2 1 0 1 2          6.7
3 0 1 1 2          3.6
4 1 1 1 3         12.2
I would like to comment briefly on the code for the two-compartment model. The observed variable is pop, y1 and y2 are not observed variables; they are compartment variables used for the purpose of modeling but have no existence as such (there aren't two species of human beings based on whether they belong to either compartment, on the contrary to the epidemic compartment models, where infected individuals were different from recovered or susceptible individuals, for example). However, their sum y1+y2 is the total population, which is the variable we're really interested in fitting, so we need a way to indicate that in the code. What should be done is to pass this total in the results list along with the differentials, as is done in line 6 in the above code. Note that the pop variable identifier in the list needs to be the same as the pop variable in the census data-frame (observed values).

Regarding the results, the delayed population growth caused by having inactive and active compartments that is the hallmark of this model did not do much to correct the flaws we saw with the previous two models. The sum-of-squares goodness of fit meassure did not improve and the growth is too fast for periods of time in the remote past while being too slow for the modern-era time period. We next attempt to fit the Condorcet-Mill model (code and results below).

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
#Condorcet-Mill population growth model
cm_model <- new("odeModel",
        main=function(time,init,parms, ...) {
            x <- init
            p <- parms
            dpop <- p["r"]*x["pop"]*(1-x["pop"]/x["K"])
            dK <- p["L"]/x["pop"]*dpop
            list(c(dpop,dK))},
        init=c(pop=2431214,K=2432000),
        parms=c(r=1.4e-03,L=3.7e+09,K0=2432000),
        times=seq(from=-10000.0,to=2015.0,by=1.0),
        solver="lsoda")

#cost function
fCost <- function(p) {
    parms(cm_model) <- p
    init(cm_model) <- c(pop=2431214,K=as.numeric(p["K0"]))
    out <- out(sim(cm_model))
    return(modCost(out, census, weight="none"))
}

#model fitting
result <- modFit(fCost, parms(cm_model),
                 lower=c(r=0,L=0,K0=2431214.1),
                 upper=c(r=1,L=Inf,K0=Inf))
(parms(cm_model) <- result$par)
cat("SC : ",ssqOdeModel(NULL,cm_model,census$time,census[2]),"\n")
#identifiability analysis
sF <- sensFun(func = fCost, parms = parms(cm_model))
(colin <- collin(sF))
           r            L           K0 
6.062496e-04 3.553315e+16 2.431214e+06 

SC :  92.28798 

  r L K0 N collinearity
1 1 1  0 2          1.4
2 1 0  1 2          9.0
3 0 1  1 2          1.5
4 1 1  1 3         11.0


The Condorcet-Mill population growth model has two parameters according to the equations in (16). However, when you try to implement the model in simecol, the question first comes up is : If K is a variable, what is its initial value? The carrying capacity is not an observable variable- it's just a theoretical construct, so we really have no way of knowing what the initial value of K is. Consequently, we should include the initial value of K as a parameter in the model to be estimated with the data at hand. That is the reason why a third parameter K0 is included but does not appear in the ODE model definition. However, if you inspect the model cost function, you'll see that before it runs a simulation of the model with the given parameters, it sets the initial value of K to K0 (this is what line 17 of the code does).

Redgarding the results of this model's fit, it is very similar to those of the other logistic growth variants. We now turn our attention to the Kapitsa model, which is really set apart from the other models thus discussed in that it is originally meant to model the human population growth over a large time scale, such as the one we're dealing with. Here is the code and the results:

1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
#Kapitsa population growth model
kapitsa_model <- new("odeModel",
        main=function(time,init,parms, ...) {
            x <- init
            p <- parms
            dpop <- p["C"]/((p["Tc"]-time)^2 +p["tau"]^2)
            list(dpop)},
        init=c(pop=2431214),
        parms=c(C=1.86e+11,Tc=2007,tau=42),
        times=seq(from=-10000.0,to=2015.0,by=1.0),
        solver="lsoda")

#cost function
fCost <- function(p) {
    parms(kapitsa_model) <- p
    out <- out(sim(kapitsa_model))
    return(modCost(out, census, weight="none"))
}

#model fitting
result <- modFit(fCost, parms(kapitsa_model),
                 lower=c(C=0,Tc=-10000,tau=0),
                 upper=c(C=Inf,Tc=10000,tau=20000),
                 method="Marq")
(parms(kapitsa_model) <- result$par)
cat("SC : ",ssqOdeModel(NULL,kapitsa_model,census$time,census[2]),"\n")
#identifiability analysis
sF <- sensFun(func = fCost, parms = parms(kapitsa_model))
(colin <- collin(sF))
           C           Tc          tau 
1.718330e+11 2.002740e+03 4.271126e+01 

SC :  0.3774335 

  C Tc tau N collinearity
1 1  1   0 2          1.4
2 1  0   1 2          2.7
3 0  1   1 2          2.4
4 1  1   1 3          5.4


It shouldn't be hard to convince yourself that this model fits the data much better than all the previous models. According to the values of the fitted parameters and the function's equation in (14), the predicted asymptotic human population level is \(\pi C/\tau \approx 12.64\,billion\), which is a much more realistic and feasible value than that obtained by the other fitted models. The sum-of-squares is 0.377, which puts the Kapitsa model in the realm of decent fitting models. Also, note that the collinearity index for the three parameters tells us that this model's parameters are identifiable. It is interesting to note that with the data we have at hand, which is surely more than what was available to Kapitsa in the 1990's, we obtained a slightly different estimate for \(T_c\): Kapitsa's was 2007 and ours is roughly 2002. Since \(T_c\) is the time of the inflection point, it is interesting that more census data brought the estimate closer to the 1960-1970 range. My final comment on this model is a caveat: even though this model is by far the best fitting model, it would be hazzardous to mistake model predictions for categorical statements. For example, I cannot categorically state that the maximum human population on earth is 12.64 billion people. Still, this would at least be an informed estimate.

Demographic Transition Theory and some concluding remarks

All of the population growth models thus covered so far attempt to describe in mathematical terms, the human population dynamics in various time scales, but they don't address questions of a more qualitative nature, such as why must population growth slow down after a certain point? Demographic transition theory addresses some of these questions. Demographic transition theory essentially states that societies that go through the process of modernization experience four stages (now possibly five), as laid down by the works of Warren Thompson in 1929 and Frank Notestein in 194522.

The first stage is the pre-modern regime in which there is a high-mortality rate and a high fertility rate in almost mutual equilibrium, resulting in very slow population growth. In the second stage, the mortality rate begins to drop rapidly due to improvements in food supply and health, which cause a longer life span and reduced mortality due to diseases. In the third stage, the fertility rates begin to decline. Notestein elaborates on the reasons for this decline in fertility, attributing it mainly to socio-economic factors. For example, in an industrialized urban society longer periods of formal schooling are required, which bring about longer periods of economic dependency for children and consequently higher costs of raising children. Lower mortality rates, as another example, increased the size of the family and the number of economically dependent people to support, acting as another deterrent to have more births. The fertility rate eventually drops to replacement levels (estimated as 2.1 children per woman) as the society's population reaches an equilibrium level in the fourth stage. In some countries already well into the fourth stage such as Japan, fertility rates continue to drop bellow replacement levels, bringing about a critical economic issue for that nation: an aging population with a dwindling active work-force to support it.

While researching to write this blog post, I became aware of a very interesting (and frightening) phenomenon occurring today in Japan. It is the case of the so-called hikikomori, a sort of post-modern hermit who shuts himself up in his parent's house and shuns the idea of having a spouse, rearing a family and even having sex. As most of you are probably aware, Japan has been in economic recession for twenty years or more. The traditional prospect of getting a job with a large corporation and remaining loyal to it for the rest of your productive life entails an insanely competitive educational sysem with dwindling job oppotunities due to the economic recession and the automation of labor. Consequently, an increasing number of young people in Japan are opting out of this traditional salaryman system, secluding themselves in their parent's house. Then there's also the phenomenon of hervibore men, a term coined by Maki Fukasawa to describe men that are not interested in finding girlfriends, wives or pursuing romantic relationships of any sort. Hervibore men usually see themselves as not bound to the traditional Japanese manliness values and show no aptitude for hurting others or being hurt by others. Like the hikikomori, they are not salarymen and usually work part-time jobs, which allows them to spend more time on things like manicure, visits to the hairdresser, and tending to their poodle dogs but do not permit them to cover costs of housing and much less family rearing. Both of these social phenomenons probably have a significant impact on the dire demographical situation of Japan.

In the old debate between Condorcet and Malthus, it would seem to me that Malthus was wrong about the inconveniency of policies benefiting the poorer sectors of society. The Nobel Prize economist Amartya Sen has referred to Malthus' approach as the "override" approach, where government policies seek to limit or coerce individual freedom in matters such as famility planning (think of the Chinese "One-child per family" policy, for example). The alternative approach, which he calls the "collaborative" approach, relies less on legal or economic restrictions and more on the rational decisions of men and women based on their expanded choices. In this sense, Amartya Sen claims that only more development can effectively alleviate the problem of overpopulation, and that for example public welfare policies such as improving women's education, in particular, can bring about a decline in birth rates:

Central to reducing birth rates,then, is a close connection between women's well-being and their power to make their own decisions and bring about changes in the fertility pattern. Women in many third world countries are deprived by high birth frequency of the freedom to do other things in life, not to mention the medical dangers of repeated pregnancy and high maternal mortality, which are both characteristic of many developing countries. It is thus not surprising that reductions in birth rates have been typically associated with improvement of women's status and their ability to make their voices heard—often the result of expanded opportunities for schooling and political activity.23


Notes

  1. See MALTHUS, p. 9 (Chapter 2).
  2. See the Wikipedia entry on Thomas Malthus.
  3. See QUETELET, p. 276.
  4. This quote pretty much sums up the spirit of nineteenth century positivism: Il est à remarquer que plus les sciences physiques fait de progrès, plus ont elles ont tendu à rentrer dans le domaine des mathématiques, qui est une espèce de centre vers lequel elles viennent converger. On pourrait, même juger du degré de perfection auquel une science et parvenue, par la facilité plus ou moins grande avec laquelle elle se laisse aborder par le calcul. (QUETELET, p. 276). To me, this idea is still in full vigor. With the advent of big data methods, we can expect the social sciences to progress by leaps and bounds.
  5. See QUETELET, p. 277.
  6. See VERHULST, p. 114.
  7. See QUETELET, p. 278.
  8. See the Wikipedia entry on Sigmoid functions.
  9. See VERHULST (1838).
  10. See TSOULARIS, p. 29.
  11. There are other population growth models with this "generalized" denomination, but here we are following the terminology in TSOULARIS (2001), where this five-parameter model is called the generalized population growth model.
  12. The Gomperz model can be derived as a limiting case of the generalized logistic growth function. See TSOULARIS (2001), pp. 34-35.
  13. See PETZOLDT, p. 52.
  14. See PETZOLDT, pp. 52-52.
  15. Cited in GOLOSOVSKY, p. 2.
  16. GOLOSOVSKY, p. 2.
  17. GOLOSOVSKY, pp. 3-4.
  18. Very few people have heard of Norman Borlaug, an American agronomist considered the father of the Green Revolution. Borlaug received the Nobel Peace Prize in 1970 for his work on high-yield varieties of cereals and cultivation methods that is credited with saving billions of people from starvation. This makes Borlaug probably one of the most underrated individuals in history. For an interesting account of this, see Harold Kingsberg's answer to the question "What people in history are underrated?" on Quora (https://www.quora.com/What-people-in-history-are-underrated).
  19. See COHEN, p. 344.
  20. See ORTIZ-OSPINA and ROSNER.
  21. See SOETAERT (2012), p. 15.
  22. See KIRK, p. 361.
  23. See SEN, p. 13.

Bibliography



If you found this post interesting or useful, please share it on Google+, Facebook or Twitter so that others may find it too.