Table Of ContentOn Solving Large-Scale Finite Minimax Problems
using Exponential Smoothing
E. Y. Pee∗ and J. O. Royset†
This paper focuses on finite minimax problems with many functions, and their solutions by means of
exponential smoothing. We conduct run-time complexity and rate of convergence analysis of smoothing
algorithms and compare them with those of SQP algorithms. We find that smoothing algorithms may have
only sublinear rate of convergence, but as shown by our complexity results, their slow rate of convergence
maybecompensatedbysmallcomputationalworkperiteration. Wepresenttwosmoothingalgorithmswith
active-set strategies that reduce the effect of ill-conditioning using novel precision-parameter adjustment
schemes. Numericalresultsindicatethattheproposedalgorithmsarecompetitivewithothersmoothingand
SQPalgorithms,andtheyareespeciallyefficientforlarge-scaleminimaxproblemswithasignificantnumber
of functions (cid:178)-active at stationary points.
Key Words. Finite minimax, exponential penalty function, smoothing techniques, active-set strategy.
1 Introduction
There are many applications that can be expressed as finite minimax problems of the form
(P) minψ(x), (1)
x∈Rd
where ψ : Rd → R is defined by
ψ(x)=(cid:52) maxfj(x), (2)
j∈Q
and fj : Rd → R, j ∈ Q=(cid:52) {1,...,q}, q > 1, are smooth functions. Minimax problems of the
form(P)mayoccurinengineeringdesign[1],controlsystemdesign[2],portfoliooptimization
[3],bestpolynomialapproximation[4],orassubproblemsinsemi-infiniteminimaxalgorithms
[5]. In this paper, we focus on minimax problems with many functions, which may result
from finely discretized semi-infinite minimax problems or optimal control problems.
∗Graduate Student, Operations Research Department, Naval Postgraduate School, Monterey, California, USA
†Assistant Professor, Operations Research Department, Naval Postgraduate School, Monterey, California, USA
Report Documentation Page Form Approved
OMB No. 0704-0188
Public reporting burden for the collection of information is estimated to average 1 hour per response, including the time for reviewing instructions, searching existing data sources, gathering and
maintaining the data needed, and completing and reviewing the collection of information. Send comments regarding this burden estimate or any other aspect of this collection of information,
including suggestions for reducing this burden, to Washington Headquarters Services, Directorate for Information Operations and Reports, 1215 Jefferson Davis Highway, Suite 1204, Arlington
VA 22202-4302. Respondents should be aware that notwithstanding any other provision of law, no person shall be subject to a penalty for failing to comply with a collection of information if it
does not display a currently valid OMB control number.
1. REPORT DATE 3. DATES COVERED
2010 2. REPORT TYPE 00-00-2010 to 00-00-2010
4. TITLE AND SUBTITLE 5a. CONTRACT NUMBER
On Solving Large-Scale Finite Minimax Problems using Exponential
5b. GRANT NUMBER
Smoothing
5c. PROGRAM ELEMENT NUMBER
6. AUTHOR(S) 5d. PROJECT NUMBER
5e. TASK NUMBER
5f. WORK UNIT NUMBER
7. PERFORMING ORGANIZATION NAME(S) AND ADDRESS(ES) 8. PERFORMING ORGANIZATION
Naval Postgraduate School,Operations Research REPORT NUMBER
Department,Monterey,CA,93943
9. SPONSORING/MONITORING AGENCY NAME(S) AND ADDRESS(ES) 10. SPONSOR/MONITOR’S ACRONYM(S)
11. SPONSOR/MONITOR’S REPORT
NUMBER(S)
12. DISTRIBUTION/AVAILABILITY STATEMENT
Approved for public release; distribution unlimited
13. SUPPLEMENTARY NOTES
in review
14. ABSTRACT
15. SUBJECT TERMS
16. SECURITY CLASSIFICATION OF: 17. LIMITATION OF 18. NUMBER 19a. NAME OF
ABSTRACT OF PAGES RESPONSIBLE PERSON
a. REPORT b. ABSTRACT c. THIS PAGE Same as 36
unclassified unclassified unclassified Report (SAR)
Standard Form 298 (Rev. 8-98)
Prescribed by ANSI Std Z39-18
The non-differentiability of the objective function in (P) poses the main challenge for
solving minimax problems, as the usual gradient methods cannot be applied directly. Many
algorithms have been proposed to solve (P); see for example [6–8] and references therein.
One approach is sequential quadratic programming (SQP), where (P) is first transcribed
into the standard nonlinear constrained problem
(P(cid:48)) min {z | fj(x)−z ≤ 0,j ∈ Q} (3)
(x,z)∈Rd+1
and then a SQP algorithm is applied to solve (P(cid:48)), advantageously exploiting the special
structure in the transcribed problem; see [7, 9]. Other approaches also based on (P(cid:48)) include
interior point methods [8, 10, 11] and conjugate gradient methods in conjunction with exact
penalties and smoothing [12].
Each iteration of the SQP algorithm in [7] solves two quadratic programs (QPs) to
compute the main search direction and a modified direction to overcome the Maratos effect.
The SQP algorithm in [7] appears especially promising for problems with many sequentially
related functions, as in the case of finely discretized semi-infinite minimax problems, due
to its aggressive active-set strategy. Recently, a SQP algorithm was proposed in [9], where
the modified direction is obtained by solving a system of linear equations. This reduces the
number of QPs from two to one per iteration, while still retaining global convergence as well
as superlinear rate of convergence. There is no active-set strategy in [9].
In general, an active-set strategy only considers functions that are active or almost
active ((cid:178)-active) at the current iterate, and thus greatly reduces the number of function and
gradient evaluations at each iteration of an algorithm. While the number of iterations to
solve a problem to required precision may increase, the overall effect may be a significant
reduction in the total number of function and gradient evaluations of the algorithm. The
numerical results for an active-set minimax algorithm in [13] give a 75% reduction in the
number of gradient evaluations, when compared against the same algorithm without the
active-set strategy. Significant reduction in computing time is also reported for active-set
strategies in [7].
In smoothing algorithms, see for example [6, 12–15], the exponential penalty function
introduced in [16] is used to produce a smooth (twice continuously differentiable) function
that approximates ψ(·). Since the problem remains unconstrained, one can use any standard
unconstrained optimization algorithm to solve the smoothed problem such as the Armijo
2
Gradient or Newton methods [6] and Quasi-Newton method [13].
A fundamental challenge of smoothing algorithms is that the smoothed problem be-
comes increasingly ill-conditioned as the approximation gets more accurate. Hence, an
unconstrained optimization solver may experience numerical difficulties and slow conver-
gence. Consequently, the use of smoothing techniques is complicated by the need to balance
accuracy of approximation and problem ill-conditioning. An attempt to address these short-
comings was first made in [15], where a precision parameter for the smooth approximation
is initially set to a pre-selected value and is then increased by a fixed factor (specifically 2)
at each consecutive iteration. Effectively, the algorithm is solving a sequence of gradually
more accurate approximations. However, the main problem with this open-loop scheme is
its sensitivity to the selection of the multiplication factor, as can be seen from the numerical
results in [6].
In [6], the authors propose an adaptive precision-parameter adjustment scheme with ex-
ponential smoothing to ensure that the precision parameter is kept small (and thus control-
ling the ill-conditioning) when far from a stationary solution, and is increased as a stationary
solution is approached. The authors use the norm of the smoothed function gradient as a
proxy for the distance to a stationary solution. When the gradient norm of the smoothed
function falls below a user-specified threshold, the precision parameter is increased to a level
thatensuresthatthegradientnormfallswithintwouser-specifiedbounds. Thenumericalre-
sults show that this adaptive scheme produces a much better management of ill-conditioning
than with open-loop schemes. The smoothing algorithms in [15] and [6] do not incorporate
any active-set strategy.
Using the same adaptive precision-parameter adjustment scheme as in [6], the authors
in [13] present a new active-set strategy that can be used in conjunction with exponential
smoothing for specifically tackling large-scale (large q) minimax problems. We note that the
convergence result in Theorem 3.3 of [13] may be slightly incorrect as it claims stationarity
for all accumulation points of a sequence constructed by their algorithm. However, their
proof relies on [6], which guarantees stationarity for only one accumulation point.
While the literature describes several smoothing algorithms for (P), there appears to be
no run-time complexity and rate of convergence analysis of such algorithms. Moreover, we
find no comprehensive empirical comparison of run times for SQP and smoothing algorithms.
In this paper, we present run-time complexity and rate of convergence results for smoothing
3
algorithms and compare them with those of SQP algorithms. We propose two new active-set
smoothing algorithms based on [6, 13] and present computational test results for large-scale
problem instances.
The next section describes the exponential smoothing technique and its properties. Sec-
tion 3 defines a smoothing algorithm and discusses run-time complexity and rate of conver-
gence. Section 4 presents two new smoothing algorithms and their proofs of convergence.
Section 5 contains numerical test results.
2 Exponential Smoothing
In this section, we describe the exponential smoothing technique, include for completeness
some known results, and show that the technique leads to consistent approximations (see
Section 3.3 of [17]).
For ease of analysis of active-set strategies, we consider the problem
(P ) minψ (x), (4)
Ω Ω
x∈Rd
where
ψ (x)=(cid:52) maxfj(x), (5)
Ω
j∈Ω
and Ω ⊂ Q. When Ω = Q, (P ) is identical to (P). Next, for any p > 0 and Ω ⊂ Q, we
Q
consider a smooth approximating problem to (P ), called the smoothed problem,
Ω
(P ) minψ (x), (6)
pΩ pΩ
x∈Rd
where
(cid:195) (cid:33)
1 (cid:88) (cid:161) (cid:162)
ψ (x) =(cid:52) log exp pfj(x) (7)
pΩ
p
j∈Ω
(cid:195) (cid:33)
1 (cid:88) (cid:161) (cid:162)
= ψ (x)+ log exp p(fj(x)−ψ (x)) (8)
Ω Ω
p
j∈Ω
istheexponentialpenaltyfunction, withlog(·)denotingthenaturallogarithm. Thissmooth-
ing technique was first introduced in [16] and later used in [6, 12–15].
We denote the set of active functions at x ∈ Rd by Ω(cid:98)(x)=(cid:52) {j ∈ Ω|fj(x) = ψ (x)}. Ex-
Ω
cept as specifically stated in Appendix A, we denote components of a vector by superscripts.
(cid:52)
We also let N denote the set of positive integers and N =N∪{0}.
0
4
The parameter p > 0 is the smoothing precision parameter, where a larger p implies
higher precision as formalized by the following proposition; see for example [13].
Proposition 2.1. Suppose that Ω ⊂ Q and p > 0.
(i) If the functions fj(·), j ∈ Ω, are continuous, then ψ (·) is continuous and decreases
pΩ
monotonically as p increases.
(ii) For any x ∈ Rd,
(cid:98)
log|Ω(x)| log|Ω|
0 ≤ ≤ ψ (x)−ψ (x) ≤ , (9)
pΩ Ω
p p
where |·| represents the cardinality operator.
(iii) If the functions fj(·), j ∈ Ω, are continuously differentiable, then ψ (·) is continuously
pΩ
differentiable, with gradient
(cid:88)
∇ψ (x) = µj(x)∇fj(x), (10)
pΩ p
j∈Ω
where
exp(pfj(x)) exp(p[fj(x)−ψ (x)])
µj(x)=(cid:52) (cid:88) = (cid:88) Ω ∈ (0,1), (11)
p
exp(pfk(x)) exp(p[fk(x)−ψ (x)])
Ω
k∈Ω k∈Ω
(cid:80)
and µj(x) = 1.
j∈Ω p
(iv) If the functions fj(·), j ∈ Ω, are twice continuously differentiable, then ψ (·) is twice
pΩ
continuously differentiable, with Hessian
(cid:88) (cid:88)
∇2ψ (x) = µj(x)∇2fj(x)+p µj(x)∇fj(x)∇fj(x)T
pΩ p p
j∈Ω j∈Ω
(cid:34) (cid:35)(cid:34) (cid:35)
T
(cid:88) (cid:88)
−p µj(x)∇fj(x) µj(x)∇fj(x) . (12)
p p
j∈Ω j∈Ω
Assumption 2.1. We assume that the functions fj(·),j ∈ Q, are twice continuously differ-
entiable.
The next lemma can be deduced from Lemma 2.2 of [6].
5
Lemma 2.1. Suppose that Assumption 2.1 holds. Then, for every bounded set S ⊂ Rd,
there exists an L < ∞ such that
(cid:104)y,∇2ψ (x)y(cid:105) ≤ pL(cid:107)y(cid:107)2, (13)
pΩ
for all x ∈ S,y ∈ Rd, Ω ⊂ Q, and p ≥ 1.
A continuous, nonpositive optimality function for (P ) is given by
Ω
(cid:176) (cid:176)
(cid:88) (cid:176)(cid:88) (cid:176)2
θ (x)=(cid:52) − min µj(ψ (x)−fj(x))+ 1 (cid:176)(cid:176) µj∇fj(x)(cid:176)(cid:176) , (14)
Ω µ∈ΣΩ Ω 2 (cid:176) (cid:176)
j∈Ω j∈Ω
where (cid:40) (cid:41)
(cid:88)
Σ =(cid:52) µ ∈ R|Ω||µj ≥ 0 for all j ∈ Ω, µj = 1 , (15)
Ω
j∈Ω
which results in the following optimality condition for (P ); see Theorems 2.1.1, 2.1.3, and
Ω
2.1.6 of [17].
Proposition 2.2. Suppose that Assumption 2.1 holds and that Ω ⊂ Q. If x∗ ∈ Rd is a local
minimizer for (P ), then θ (x∗) = 0.
Ω Ω
The continuous, nonpositive optimality function
θ (x)=(cid:52) − 1(cid:107)∇ψ (x)(cid:107)2 (16)
pΩ 2 pΩ
characterizes stationary points of (P ) as stated in the next proposition; see Proposition
pΩ
1.1.6 in [17].
Proposition 2.3. Suppose that Assumption 2.1 holds, p > 0, and Ω ⊂ Q. If x∗ ∈ Rd is a
local minimizer for (P ), then θ (x∗) = 0.
pΩ pΩ
We next show that the exponential smoothing technique leads to consistent approxima-
tions (see Section 3.3 in [17]), which ensures that globally and locally optimal points as well
as stationary points of (P ) converge to corresponding points of (P ), as p → ∞. Consis-
pΩ Ω
tent approximations also facilitate the construction of implementable algorithms for (P); see
Algorithm 4.1 below.
We define consistent approximation as on page 399 of [17].
6
Definition 2.1. For any Ω ⊂ Q, p > 0, we say that the pair ((P ),θ (·)) is a consistent
pΩ pΩ
approximation to ((P ),θ (·)) if (i) (P ) epi-converges to (P ), as p → ∞, and (ii) for any
Ω Ω pΩ Ω
sequences {x }∞ ⊂ Rd and {p }∞ ,p > 0 for all i, and x∗ ∈ Rd such that x → x∗ and
i i=0 i i=0 i i
p → ∞, as i → ∞, limsup θ (x ) ≤ θ (x∗).
i i→∞ piΩ i Ω
Theorem 2.1. Suppose that Assumption 2.1 holds, p > 0, and Ω ⊂ Q. Then, the pair
((P ),θ (·)) is a consistent approximation to ((P ),θ (·)).
pΩ pΩ Ω Ω
Proof. We follow the proofs of Lemmas 4.3 and 4.4 in [18], but simplify the arguments as
[18] deals with min-max-min problems. By Theorem 3.3.2 of [17], Proposition 2.1(ii), and
the continuity of ψ (·), it follows that (P ) epi-converges to (P ), as p → ∞.
Ω pΩ Ω
We next consider the optimality functions. Let {x }∞ ⊂ Rd and {p }∞ ,p > 0 for all
i i=0 i i=0 i
i, be arbitrary sequences and x∗ ∈ Rd be such that x → x∗ and p → ∞, as i → ∞. Since
i i
µj(x) ∈ (0,1) for any j ∈ Ω, p > 0, and x ∈ Rd, {µ (x )}∞ is a bounded sequence in
p pi i i=0
R|Ω| with at least one convergent subsequence. Hence, for every such subsequence K ⊂ N ,
0
there exists a µ ∈ Σ such that µ (x ) →K µ , as i → ∞. Moreover, since µ ∈ Σ ,
∞ Ω pi i ∞ ∞ Ω
(cid:80)
µj = 1.
j∈Ω ∞
If j ∈/ Ωˆ(x∗), then there exist a t > 0 and i ∈ N such that fj(x )−ψ (x ) ≤ −t for all
0 i Ω i
i ≥ i . Hence, from (11), µj (x ) → 0, as i → ∞, and therefore µj = 0. By continuity of
0 pi i ∞
∇fj(·), j ∈ Ω,
(cid:88)
1
θ (x ) →K − (cid:107) µj ∇fj(x∗)(cid:107)2 =(cid:52) θ (x∗), (17)
piΩ i 2 ∞ ∞Ω
j∈Ω
as i → ∞. Since µ ∈ Σ and µj = 0 for all j ∈/ Ωˆ(x∗), we find in view of (14) that
∞ Ω ∞
(cid:88) (cid:88)
1
θ (x∗) = − µj (ψ (x∗)−fj(x∗))− (cid:107) µj ∇fj(x∗)(cid:107)2 ≤ θ (x∗). (18)
∞Ω ∞ Ω 2 ∞ Ω
j∈Ω j∈Ω
This completes the proof.
3 Run-Time Complexity and Rate of Convergence
In this section, we focus on the run-time complexity and rate of convergence of smoothing
algorithms. Specifically, we deal with the following simple smoothing algorithm for solving
(P) based on application of the Armijo Gradient Method1 to (P ).
pQ
1TheArmijoGradientMethodusesthesteepestdescentsearchdirectionandtheArmijostepsizeruletosolvean
unconstrained problem; see for example Algorithm 1.3.3 of [17].
7
Algorithm 3.1. Smoothing Armijo Gradient Algorithm
Data: t > 0,x ∈ Rd.
0
Parameter: δ ∈ (0,1).
Step 1. Set p∗ = logq/((1−δ)t).
Step 2. Generate a sequence {x }∞ by applying Armijo Gradient Method to (P ).
i i=0 p∗Q
We denote the optimal value of (P) (when it exists) by ψ∗, the optimal value of (P )
pQ
(when it exists) by ψ∗ for any p > 0, and the optimal solution of (P ) (when it exists) by
pQ pQ
x∗ . Algorithm 3.1 has the following property.
pQ
Proposition 3.1. Suppose that Step 2 of Algorithm 3.1 has generated a point x ∈ Rd such
i
that ψ (x )−ψ∗ ≤ δt. Then, ψ(x )−ψ∗ ≤ t.
p∗Q i p∗Q i
Proof. The result follows directly from (9) and the selection of p∗.
Proposition 3.1 shows that we can obtain a near-optimal solution of (P) by approxi-
mately solving (P ) for a sufficiently large p. As discussed in Section 1, Algorithm 3.1 will
pQ
be prone to ill-conditioning. Adaptive schemes for adjusting the precision parameter p and
the use of another method in Step 2 may perform better in practice. However, the following
study of run-time complexity and rate of convergence of Algorithm 3.1 provides fundamental
insights into smoothing algorithms in general.
We start with some intermediate results that utilize the following convexity assumption.
Assumption 3.1. Suppose that fj(·),j ∈ Q, are twice continuously differentiable and there
exist 0 < m ≤ M < ∞ such that
m(cid:107)y(cid:107)2 ≤ (cid:104)y,∇2fj(x)y(cid:105) ≤ M(cid:107)y(cid:107)2, (19)
for all x,y ∈ Rd, and for all j ∈ Q.
Lemma 3.1. Suppose that Assumption 3.1 holds. For any x,y ∈ Rd and p > 0,
(cid:173) (cid:174)
m(cid:107)y(cid:107)2 ≤ y,∇2ψ (x)y . (20)
pQ
8
Proof. From (12) and (19), we obtain that
(cid:173) (cid:174) (cid:88) (cid:173) (cid:174) (cid:88) (cid:173) (cid:174)
y,∇2ψ (x)y = µj(x) y,∇2fj(x)y +p µj(x) y,∇fj(x)∇fj(x)Ty
pQ p p
j∈Q j∈Q
(cid:42) (cid:34) (cid:35)(cid:34) (cid:35) (cid:43)
T
(cid:88) (cid:88)
− p y, µj(x)∇fj(x) µj(x)∇fj(x) y
p p
j∈Q j∈Q
(cid:88) (cid:173) (cid:174) (cid:88) (cid:173) (cid:174)
= µj(x) y,∇2fj(x)y +p µj(x) y,∇fj(x) 2
p p
j∈Q j∈Q
(cid:42) (cid:34) (cid:35)(cid:43)
2
(cid:88)
− p y, µj(x)∇fj(x)
p
j∈Q
(cid:42) (cid:34) (cid:35)(cid:43)
2
(cid:88) (cid:173) (cid:174) (cid:88)
≥ m(cid:107)y(cid:107)2 +p µj(x) y,∇fj(x) 2 −p y, µj(x)∇fj(x) .
p p
j∈Q j∈Q
Hence, we only need to show that the difference of the last two terms is nonnegative. Let
g : Rd → R be defined as g(z) = (cid:104)y,z(cid:105)2. The function g is a composition of a convex function
with a linear function, so it is convex; see for example Proposition 2.1.5 of [19]. Hence, it
follows from Jensen’s inequality (see for example page 6 of [19]) that
(cid:195) (cid:33)
(cid:88) (cid:161) (cid:162) (cid:88)
µj(x)g ∇fj(x) ≥ g µj(x)∇fj(x) . (21)
p p
j∈Q j∈Q
Since p > 0, the result follows.
Proposition 3.2. Suppose that Assumption 3.1 holds and p ≥ 1. Then, the rate of conver-
gence for the Armijo Gradient Method to solve (P ) is linear with coefficient 1−k/p, for
pQ
some k ∈ (0,1). That is, for any sequence {x }∞ ⊂ Rd generated by the Armijo Gradient
i i=0
Method when applied to (P ),
pQ
(cid:181) (cid:182)
k
ψ (x )−ψ∗ ≤ 1− [ψ (x )−ψ∗ )] for all i ∈ N . (22)
pQ i+1 pQ p pQ i pQ 0
Proof. Based on Lemmas 2.1 and 3.1, for every bounded set S ⊂ Rd there exist an L ∈
[m,∞) such that
(cid:173) (cid:174)
m(cid:107)y(cid:107)2 ≤ y,∇2ψ (x)y ≤ pL(cid:107)y(cid:107)2, (23)
pQ
for all x ∈ S and y ∈ Rd and p ≥ 1. Hence, we deduce from Theorem 1.3.7 of [17] that the
rate of convergence for Armijo Gradient Method to solve (P ) is
pQ
4mβα(1−α)
1− ∈ (0,1), (24)
pL
9