Extrapolating by Inverse Optimization

Wherein I present a simple example of a prediction problem modeled as inverse optimization and its formulation as a bilevel optimization problem, with a brief detour into Bayesian estimation.

Our story begins with Alice who runs a grocery store with a small selection of locally grown produce. Alice wishes to estimate the impacts of changing the prices of her produce on the purchase behavior of her customers and wonders if the data she has been collecting over the previous year can help. After consulting with her data scientist friend she puts together a database of quantities of produce purchased by individual customers under various pricing regimes, and hands him the trove.

The collected data looks something like the following. (In each row the prices and quantities are for all the products Alice sells.)

Date	Customer.ID	Price.Vector	Quantity.Vector
Jan 1	XYZ	[$3.2, $2.5, $1.0, …]	[1 lb, 0.0 lb, 0.0 lb, …]
Jan 8	ABC	[$2.0, $2.0, $1.5, …]	[0.0 lb, 0.0 lb, 0.5 lb, …]
Feb 4	XYZ	[$2.5, $2.0, $1.8, …]	[1.0 lb, 0.0 lb, 0.1 lb, …]
…	…	…	…

The data scientist friend doesn’t see what the big problem is. He builds deep learning models in his sleep and this seems like a simple (multi-output) regression problem. He could just build a machine to predict the quantities column from the price column and take the afternoon off.

Alas it was not to be; the price vectors that Alice wants to plug into the trained regression model are well outside the ranges of those that she provided the data scientist for training. She does not see the point of any such modeling if it doesn’t enable extrapolation. She is appalled by the resulting nonsensical predictions, that her friend assures her were produced by algorithms indistinguishable from the ones that in the near future will drive her around town.

A Physics of Purchase Behavior

Alice decides to take a less shallow approach and write out what she knows about the problem and how it relates to what she wishes to estimate.

She reasons that each of her customers $\smash{i}$ has a weekly budget $\smash{b_{i}}$ for produce ($\smash{t}$ stands for the date of the store visit) and a personal produce value vector $\smash{v_i}$. (She makes the simplifying asssumption that the budget does not vary from week to week.) Each customer may have other individual constraints on the purchases (e.g., at least $2$ lb of fruit, no more than $5$ lb of perishables.)

She hypothesizes that a customer on every weekly visit looks at the produce price vector $\smash{p_t}$, and mentally runs a simple optimization of the following form to decide what quantities of each item of produce ($x_{it}$) he will purchase:

\[\begin{aligned} x_{it} &= \mbox{arg max}_x \; v_i^T x \\ \mbox{subject to} &\;\; \smash{p_t^T} x \leq b_{i} \\ &\; \; A x \leq d_i \end{aligned} \]

The last constraint captures the notion of the customer-specific constraints mentioned above, where $A$ is assumed known (since its rows define some common purchase constraints) while the constraint bounds $d_i$ are not known.

The above optimization problem with a linear objective and constraints is called a Linear Program for which there exist very efficient solvers. Look below for a basic implementation of a linear program in R.

That is, if for any customer Alice knew the parameters of their optimization problem ($v_i$ up to a normalizing scalar, $b_i$ and $d_i$) then predicting the purchase behavior is easy. It just involves solving the above problem for any price vector of interest $p$ to obtain a prediction of that customer’s purchase quantity vector $x_i$. Note that, albeit esoteric, the optimizer defines a family of regression functions $x_i = Opt(p_t; \theta)$ parameterized by $\theta \triangleq (v_i, b_i, d_i)$, in principle, no different from any other family.

Therefore the problem of learning the regressor reduces to estimating these parameters from historical prices and their associated quantity vectors. This estimation problem, because it is posed as inverting the optimization procedure, is studied under the heading Inverse Optimization.

(The term engineers use for such an inverting process, that is, estimating the parameters of a system from observations of behavior, is System Identification. Although many of the techniques that engineers use exploit some special structure of their systems, their task is exactly the same as Alice’s.)

Bayesianism as Commonsense

Alice first codes up the $Opt(.)$ function in the hope that when she comes up with a way to estimate the parameters, she would be ready with the prediction function. She quickly realizes that a simple approach to estimate the parameters is the following. First select some sensible ranges for the various parameters (knowing the neighborhood well she has a good sense of what reasonable budgets, value vectors etc.), sample a vector of parameters within that range and run the function $Opt(.)$ over the prices in her dataset. If the resulting predictions of the quantities are close to those in the data, keep the parameter vector, else discard it. We can view this procedure as sampling, followed by running the $Opt(.)$ regression function (which I will call simulation, because it simulates the purchase behavior), followed by filtering the samples that do not yield data close to the historical record (the observations).

After running this sample-simulate-filter process on some large number of times she will have a set of parameter vectors which produced data that looks like her database, that is, a set of parameter vectors compatible with her data. If this set ends up being clustered around some value (i.e., if very different parameter settings do not end up producing data close to the observations), then she could choose the mean of the filtered parameter set as her estimate. Problem solved.

In fact, she reasons, that instead of taking the mean of the filtered parameter vectors, she could use them to estimate her uncertainty in the predictions from her regression function. For a price vector that she would like to plug into her regression function, she could run the $Opt(.)$ function over a random sample of compatible parameter vectors and use the range of the predicted values as her measure of uncertainty.

Although Alice is unware of it, she not only independently derived the fundaments of Bayesian reasoning from commonsense, but a particular computational approach to Bayesian estimation. An entire subfield of Bayesian statistics, with journals, yearly conferences and other trimmings, called Approximate Bayesian Computation (ABC) is dedicated to Alice’s sample-simulate-filter method, with the goal of making it computationally feasible. Note how the ABC approach can be applied to invert any parameterized simulator, at least in principle. This generality is paid for in computational costs.

A Quick Aside on Identifiability. Identifiability is a topic that statisticians like to obsess over in their work, resulting in a thicket of abstruse procedures for practitioners to follow to avoid a dull feeling of being insufficiently diligent.

From a Bayesian vantage the commonsense way to think about identifiability is captured in a statement made above – very different parameter settings do not end up producing data close to the observations. That is, if vastly different (a priori equally likely) parameter values are compatible with the data then we say that the parameters of the model are non-identifiable. In fact, in practice it is even a bit milder than that. Loosely speaking, for the kinds of extrapolations we wish to perform with the learned model (i.e, simulator + compatible parameter vectors), if the predictions on any input from most of the compatible parameter vectors are similar, then for practical purposes the model is identified.

Identifiability in a Bayesian setting is a function of the prior, the simulator, and the observed data. If we can disallow parts of the parameter space which end up being compatible with the data by some external considerations, or if we are able to collect more of the right kind of data, we can recover idenfitiability. If neither is possible, the only hope is to restrict the simulator by reparameterization and only gain partial knowledge about the system being studied, consequently allowing only a restricted type of extrapolation.

Saving on AWS Costs

As I alluded to above, the ABC procedure can be very computionally demanding when the amount of observed data or the number of parameters is large. Consequently for models (priors + simulators) for which we can know a little bit more, statisticians and engineers have come up with a host of techniques to speed up the estimation process. For example, if the simulation function is such that we can compute the likelihood, that is, the probability of the observations for a given parameter vector, techniques like Markov-Chain Monte Carlo (MCMC) or Variational Bayes can exploit that fact for a speedy estimation of the model posterior (which is a specification of the parameter vectors compatible with the observed data.) I won’t go any more into these techniques here than pointing to my presentation on variational inference.

Finally Getting to the Point

It turns out that for the particular model that Alice constructed by representing the purchase quantities as solutions of linear programs, the parameters of the model can be estimated by exploiting optimality properties of linear programs.

For simplicity let us assume that the measure of distance of the predictions from the simulator to the observed data is the sum of squared errors. We can now restate Alice’s parameter estimation problem as

\[\begin{aligned} \hat{\theta}_i, \hat{x} \triangleq (\hat{v}_i, \hat{b}_i, \hat{d}_i, \hat{x}) &= \mbox{arg min}_{\theta, x} \;\sum_t{(x_{it} - q_{it})^2} \\ \mbox{subject to} &\;\; x_{it} = Opt(p_t, \theta) \end{aligned} \] where $p_t$ is the price for week $t$ and $q_{it}$ is the quanty vector purchased by customer $i$ that week. Note how even though we only care about an estimate for $\theta$ the optimization problem is over $\theta$ as well as $x$ (the vector of predictions of the quantities from the model).

Again, the above optimization problem is to choose the parameter vector $\theta_i$ (and the predictions $x$ from that parameter vector) that minimize the squared error between the predictions $x_{it}$ of the simulator and the observed quantities $q_{it}$. The solution of this problem will yield the estimate the parameter vector for customer $i$.

(Like in the discussion above if we have prior knowledge about reasonable values for the parameters or on how the parameters vary between customers, we can use this knowledge by adding cost terms into the objective. Note that this way of adding cost terms into the objective to represent prior knowledge and then minimizing the overall cost yielding a point estimate of the parameters is called Maximum A Posteriori estimation in Bayesian lingo.)

We notice that the above inverse optimization problem has a nested structure. The inner constraints are based on the $Opt(.)$ which is a linear program, i.e., an optimization problem. The problem is consequently a special case of what is called a bilevel optimization problem.

Bilevel optimization problems have the structure of an outer optimization problem subject to constraints dervied from an inner optimization problem. General bilevel optimization problems are difficult to solve but for certain special cases (which include Alice’s problem) they can be reformulated as Mathematical Programs with Equilibrium Constraints (MPEC), where certain optimality conditions of the inner optimization problem are exploited to avoid the inner optimization-based constraints.

Let us make all this a little clearer with Alice’s example. Again her $Opt(.)$ problem is written as \[\begin{aligned} x_{it} &= \mbox{arg max}_x \; v_i^T x \\ \mbox{subject to} &\;\; \smash{p_t^T} x \leq b_{i} \\ &\; \; A x \leq d_i \end{aligned} \]

For $x_{it}$ to be the optimal solution for the above problem it must necessarily satisfy the Karush-Kuhn-Tucker (KKT) conditions which for Alice’s problem state that there must exist a scalar $\lambda$ and a vector $\mu$ (which has as many entries as there are rows in $A$) that together with $x_{it}$ satisfy the following constraints.

\[\begin{aligned} v_i + \lambda\, p_i + A^T \mu &= 0 \\ \smash{p_t^T} x_{it} &\leq b_{i} \\ A x_{it} &\leq d_i \\ \lambda &\geq 0 \\ \mu &\geq 0 \\ \lambda (p_t^T x_{it} - b ) &= 0 \\ \mu^T (A x_{it} - d_i) &= 0 \end{aligned} \]

The quantity $\lambda$ and the entries of $\mu$ are called the Lagrange multipliers in optimization theory, and there is one for every constraint in the problem. Note how the KKT constraints are all linear in $x_{it}$, $\lambda$ and $\mu$ except for the last two, which are called the complementarity constraints. (Complementary because, if $\lambda$ is non-zero then $p_t^T - x$ has to be zero and vice-versa.) They encode the notion that if at optimality a certain constraint is inactive (i.e., the constraint might as well have not been part of the original problem – it does not restrict the optimal solution), then the corresponding Lagrange multiplier has to be zero.

The first condition is called the stationarity condition, the second and third are called primal feasibility conditions (these are just a copy of the constraints from the original problem), and the fourth and the fifth are called dual feasibility conditions (the Lagrange multipliers for inequality constraints cannot be negative).

For a linear program the KKT conditions are not only necessary but sufficient for optimality. That is if we found a primal feasible $x_{it}$ and dual feasible $\lambda$ and $\mu$ which also satisfy the stationarity and complementarity conditions, then $x_{it}$ is an optimal solution to the original optimization problem.

We can use this fact to rewrite the bilevel optimization problem. That is we can translate the problem

\[\begin{aligned} \hat{\theta}_i, \hat{x} \triangleq (\hat{v}_i, \hat{b}_i, \hat{d}_i, \hat{x}) &= \mbox{arg min}_{\theta, x} \;\sum_t{(x_{it} - q_{it})^2} \\ \mbox{subject to} &\;\; \\ & v_i + \lambda\, p_i + A^T \mu = 0 \\ & \smash{p_t^T} x_{it} \leq b_{i} \\ & A x_{it} \leq d_i \\ & \lambda \geq 0 \\ & \mu \geq 0 \\ & \lambda (p_t^T x_{it} - b ) = 0 \\ & \mu^T (A x_{it} - d_i) = 0 \end{aligned} \]

which is no longer a bilevel optimization problem, but, because of the two complementarity constraints, a mathematical program with equilibrium constraints (MPEC). Many techniques have been proposed to handle these constraints but conceptually the most straightfoward is to rewrite them using boolean variables, thereby converting the above problem into a Mixed Integer Quadratic Program (MIQP). See below for more details on this conversion.

Discussion

Assume that Alice was successful in accurately estimating the parameters of her model $Opt(.)$ and now wants to maximize her own profit by adjusting her price vector $p$. One should be able to convince oneself that this problem is also a bilevel optimization problem. In fact, the interaction of her prices and the customer purchase behavior is an instance of a Stackleberg game that models economic scenarios with such a leader-follower dynamic. Bilevel optimization problems pop up in the study of equilibria in Stackleberg games, and have many other economic applications. While the price of fruit is important in its own right, my own interest in this topic is due to its applicability to designing optimal energy portfolios.

Before I conclude, I want to discuss the connection between Inverse Optimization and Inverse Reinforcement Learning (IRL), which at least superficially seem like they ought to be related. IRL deals with the estimation of the reward function that an agent is optimizing by observing its behavior in an environment. This can be useful, for example, when we want to train a robot to imitate a human whose behavior we can observe but not the reward function she is optimizing.

To formulate Alice’s problem as IRL, we model her customers as agents who find themselves in a state which is the price regime set by Alice, and take an action which is the quanities they purchase. In order to train RL agents that behave like the customers, we would need the reward function that maps a state and action to a reward value, i.e., a function that maps a price and quantity vector to a reward. The reward function for the $i^{th}$ customer in Alice’s model be written as (where $s$ is the state and $a$ is the action)

\begin{aligned} r(s, a) &= { \begin{array}{ll} v_i^T a ;; ;; s^T a b_i ;& ;; A, a d_i \ -;;

            \end{array}
          \right.

\end{aligned} IRL attempts to approximate this reward function (without assuming the functional form above) from historical data (that is, the recorded behavior of customers), which can then be used to reinforcement learn a simulator for customer behavior – essentially to reverse engineer the $Opt(.)$ function from data. In other words, the IRL way of solving Alice’s problem lies somewhere in between the regression approach and inverse optimization approach, by assuming a bit more than the former and a bit less than the latter.

Miscellany

Linear programming in R. Example implementation of Alice’s $Opt(.)$ function.

library(CVXR)

Opt <- function(p,                  #input price vector
                v = c(0.1, 0.9),    #value vector
                b = 10,             #budget
                d = c(5, 1))        #rhs of constraints
{
  x <- Variable(2)                    #only two products 

  value <- sum(v * x)                 #total value that the customer optimizes
  
  constr_budget <- sum(p * x) <= b    #total cost no more than budget
  A <- rbind(c(1, 1), c(0, 1))        #"other" constraint matrix
  constr_other <- A %*% x <= d        #other constraints
  
  #solve for value maximization
  solution <- solve(Problem(Maximize(value), list(constr_budget, constr_other))) 
  return(list(status = solution$status, quantities = solution$getValue(x)))
}

sol = Opt(p = c(3, 2))               #predict quantities of the two products

cat(paste0('Status: ', sol$status))
## Status: optimal
cat(paste0(round(sol$quantities, 2))) #print optimal quantity vector
## 2.67 1

A mixed-integer formulation for OR constraints (so-called “disjunctions”). Complementarity constraints restrict a variable to be zero whenever another is non-zero. More generally in a optimization problem over a vector $x$, we might want to add the constraint that at least one of the two linear functions $f(x)$ and $g(x)$ must be zero. That is, either $f(x) =0$ or $g(x) =0$, written as $f(x) g(x) = 0$ (a so-called bilinear constraint).

This non-linear constraint can be “linearized” by adding another variable $z$ to the optimization problem and replacing the bilinear constraint with the following constraints

\[\begin{aligned} -Mz \leq f(x) &\leq M z \\ -M(1-z) \leq g(x) &\leq M (1-z) \\ z &\in \{0, 1\} \end{aligned} \]

where $M$ is some large positive number. When the binary variable $z$ takes the value $0$ the constraints restrict $f(x)$ to be identically zero and leave $g(x)$ unconstrained. When it takes the value of $1$ the reverse becomes true.

This trick for converting disjunctive constraints (unions of sets) into intersections is called the big-M reformulation.

References

Andrew Gelman, How to think about “identifiability” in Bayesian inference?
Ahuja and Orling, Inverse Optimization, Part 1: Linear Programming and General Problem. This paper talks about finding the cost vector $v$ of a linear programming problem given the solution. Under their particular formulation this inverse problem itself turns out be a linear program.
Dong and Chen, Generalized Inverse Optimization through Online Learning. This paper presents an algorithm to perform online updates to track changing parameters of an optimization problem when observations of inputs and decisions are revealed one at a time.

2019/03/09