|
Srinivasan Balan North Carolina State University |
With the advancement of technology and computing capabilities, computer science and optimization have come a long way in solving complex, large-scale real-world problems. How are these challenging problems solved? Algorithms. Algorithms work hard to systematically search the solution space for the desired goal. Many different algorithms could be used to search the solution space, but it is critical we use the 'best' algorithm in this fast-paced, outcome-based world. Deciding on the 'best' algorithm depends on key performance indicators (KPI) like speed, efficiency, run time, and space complexity. KPI's change based on the type of problem and the objective of the algorithm. Some methods demand a quick solution with reasonable quality, while others search for optimal solutions like a global maximum or minimum.
This article was inspired by an analytics webinar series conducted by Opex Analytics titled "Five algorithms that changed the world", presented by Dr. Larry Snyder. I borrowed excerpts from that webinar and added my own flavor to present the top five algorithms that have changed the world of problem-solving, optimization, and operations research.
Our first algorithm is Gradient Descent (GD), proposed by Cauchy in the early 1850s to find the minimum of any function. Consider a convex function shaped like a bowl (Figure 1). If the objective is to minimize the convex function, then the GD algorithm guarantees a global optimum. The idea is to start the search direction from the top of the surface and reach the feasible bowl-shaped region's lowest point. The pseudo-code of GD is shown below in Algorithm 1. GD starts with an initial guess of the minimum point, Xo. The traversing direction (slope) is given by the negative gradient (Δf(Xn)) while the step length, or the incremental distance which moves the minimum point from Xo to X1, is determined by the learning rate αn. This process of deciding on a direction to move and taking a step in that direction continues until the gradient becomes zero or the algorithm reaches the desired epsilon (10-8) change in the value of the function value at subsequent iterations.
Figure 1: Gradient Descent algorithm
Many machine learning models utilize the GD algorithm to select and tune parameters. For instance, regression models use GD to find the best fit intercept and slope for each predictor variable. Similarly, the back-propagation algorithm uses GD in artificial neural networks and deep learning models. Updates to the GD algorithm have adapted it to today's complex problems like learning from big data and fast computation requirements. Big data demands mini-batch gradient descent, a version of GD which uses a sample batch of data as the starting points. This mini-batch GD uses a predefined set of batches in each epoch, which helps the back-propagation algorithm converge to find the best weights and reach the best possible minimum value. Stochastic gradient descent (SGD) is used in fast computation requirements. The training set is randomized and in every iteration only one training data point is used to find the gradient of the cost function. Thus, SGD might not achieve accuracy but wanders around the region close to the global minimum.
Second, we move on to the random number generator proposed by Lehmer in 1951 and known as Linear Congruential Generator (LCG). LCG is used to generate independent and uniformly distributed random numbers in a long sequence without repetition. The pseudo algorithm is shown below in the Algorithm 2. The algorithm works with the residues of successive powers of the random number generated in each iteration with good randomness properties. This algorithm requires only a few initial conditions to decide the parameters a, c, and m.
LCG is faster than other random number generators and requires less memory to store state space. As such, LCG is often used in cryptography to secure pseudo-random number generators. The fields of computing and random processes use advanced versions of LCG, such as the Mersenne twister M31). This version of the LCG, developed by Matsumoto and Nishimura in 1997, generates one of the most extended random number sequences, (219937-1) numbers before repeating. Many machine learning applications use the latest versions of LCG to predict the future state of random processes in the technology space and genetic engineering and biomedical engineering.
Third, we discuss one of the most important contributions in mathematical optimization and computer programming: the recursive algorithm known as Dynamic Programming (DP), proposed by Richard Bellman in the late 1950s. This algorithm works to decompose a master problem into smaller, easier-to-solve sub-problems, which are sequentially solved until it works back to the master problem's solution. The sub-problems are nested recursively to find the solutions which give optimal substructure to the master problem. Thus, the sub-problems' solutions compose the solution to the master problem. Some applications of this classic sequential decision-making algorithm include inventory management, database access, flight control, routing, and RNA structure prediction. For example, the Bellman equation, used for a general recursive minimization problems, is shown in Equation 1.
where C is the immediate reward or cost for state-action pair (X, a), θt(X) is the optimal expected cost with the starting state X and calculated as the sum of the immediate cost for (X, a) and the future expected cost for all X' with the given (X, a) state. γ is the discounting factor such that the value of $1 in period $t+1$ is equal to γ in period t.
There are many DP extensions in optimization applications; value iterations used in the Markov Decision processes, Hamilton Jacobi Bellman equation for solving partial differential equations, and optimal control theory. In supply chain theory, DP has been applied in both the Bellman-Ford algorithm for finding the shortest distance between two nodes and in backward induction for finite horizon discrete multi-stage or multi-period problems. Many stochastic optimization problems demand DP as a benchmark solution method because DP assumes full knowledge of the random variable's underlying distributions. For instance, Clark and Scarf used DP to prove the optimality of base-stock policies for serial supply chain systems, which led to significant breakthroughs in finding the optimal global solutions to various supply chain topologies. The stochastic version of DP gives an exact optimum for stochastic problems, unlike stochastic programming, chance-constrained linear programs, and robust optimization techniques. However, due to the curse of dimensionality, if the number of state spaces grows exponentially then DP fails to solve the large problem in reasonable run time. Hence, faster algorithms imitating the DP concept have been developed. Computer scientists use algorithms like reinforcement learning, while operations researchers use approximate dynamic programming (ADP), an algorithm widely used in stochastic optimization, deep reinforcement learning, and artificial intelligence applications.
Fourth in our list is the Fast Fourier Transform (FFT). Tukey and Cooley proposed FFT in 1965 as an advancement of Joseph Fourier's Fourier Transform (FT) developed in the early 1800s. In essence, FT is used to reconstruct the sine waves from a combined wave format. Joseph Fourier showed that some functions could be expressed as an infinite sum of harmonics. Since the Fourier series represents signals in terms of sines and cosines based on the frequency and phase-type, the oscillating wave structure can be decomposed to select the signals of choice. In other words, any signal can be decomposed into sine waves based on their frequency (pitch) and amplitude (volume). The basic FT equation for function f(x) is given here
where is based on Euler's formula. Unfortunately, FT fails to process signals in a reasonable time. For instance, if a signal has 1,000,000 data points, FT takes 20 hours, not practically feasible for real-world applications. In order to speed up computation time, John Tukey and James Cooley invented a faster version of FT in order to detect nuclear tests by the USSR during the Cold War. If a signal has 1,000,000 data points, FFT solves it in 0.05 seconds, an impressive improvement from 20 hours. The pseudo-code for FFT is shown below in Algorithm 3. Today, FFTs are widely used in engineering, music, science, and mathematics and was included in 'Top 10 Algorithms of 20th Century' by the IEEE magazine Computing in Science & Engineering.
Figure 2: FFT algorithm
To illustrate FFT algorithm, the discrete version of the FFT is added below. The discrete Fourier transform (DFT) of f at frequency k is given by
where i=(-1)0.5. The basic idea is to divide and conquer recursively by breaking down the DFT into smaller DFT's for any composite number N=pq. The smaller DFT's are multiplied by twiddle factors (complex roots of unity). If N has prime factorization N=p1,p2...pm, then FFT requires only N*∑i pi operations.
The final algorithm topping our list has revolutionized operations research: the Simplex Algorithm developed by George B. Dantzig during World War II. Formulating linear inequalities was first presented by Motzkin in his Ph.D. thesis in 1936, later by Kantorovich in 1939 in economic analysis, and finally by Hitchcock in 1941. However, Dantzig was the first person to collectively solve a linear program (LP) with linear constraints and a linear objective function [3]. Both Koopman and Dantzig formulated the initial LP problem as a maximization objective and named the algorithm as "climbing up the bean pole" [3]. Initially, the US military used the Simplex Algorithm for planning and transportation problems. However, now the algorithm is used in the vast majority of optimization problems, including airline scheduling, production planning, vehicle routing, and supply chain network design.
Linear inequalities are defined as half-spaces of the hyperplanes, thus, form a convex polyhedron (Figure 3). This way, the optimal solution will be a corner point or one of the extreme points of the polyhedron. Therefore, an effective algorithm is to quickly traverse extreme points in the polyhedral hyperspace to satisfy the objective type. For two variables, the LP problem can be easily solved using a graphical method.
Figure 4: Corner points of the polyhedron
The linear programming objective is to find z= min cTx such that Ax=b, for all x≥0, where x = (x1,....xn), A is an m x n matrix and b and c are column vectors. The pseudo code for the simplex algorithm is shown in Algorithm 4.
Several extensions of the simplex solution method, such as Wolfe decomposition and Benders decomposition, have been developed to reduce the run time and solve large scale LP problems. High performing cutting plane algorithms and branch & bound algorithms were developed to find integer solutions, essential for real-life applications. Solving integer programs requires a fast computing LP algorithm. In 1972, Klee and Minty presented an LP problem using a corner point search to evaluate all extreme points and proved the Simplex algorithm has a worst-case exponential runtime. Several authors tried developing algorithms to solve LP problems in polynomial run time. Leonid Khachiyan theoretically proved that the proposed Ellipsoidal method is a polynomial-time algorithm. At the same time, Karmarkar's breakthrough was an interior point algorithm used to solve large LP's in polynomial run time. As a result of all these improvements, today, complex mixed-integer linear programs (MIP) with millions of variables and constraints are solved using commercial software in reasonable run times. For example, IBM ILOG Cplex shortened its runtime by a factor of 30,000 from 1991 to 2007. In addition, Gurobi claims that their solver is the world's fastest for solving LP and MIP problems.
The discovery and improvement of algorithms is a continuous process. Several commercial software companies are pushing their R&D scientists to implement advanced machine learning techniques like neural networks, deep learning coupled with cloud computing, and parallel computing approaches in advanced hardware and computer architecture. We cannot wait to see what algorithms are being created by such researchers to change the world. In short, we can expect some breakthrough algorithms are coming in the future, which will solve the most complicated and complex problems in shorter run times.
References:
[1] Brillhart John. Derrick henry lehmer. ACTA ARITHMETICA LXII.3, Tucson, Arizona, 1992.
[2] James W Cooley and John W Tukey. An algorithm for the machine calculation of complex fourier series. Mathematics of computation, 19(90):297–301, 1965.
[3] George B. Dantzig. Origins of the simplex method. Technical report from Systems Optimization laboratory SOL 87-5, Department of Operations Research, Stanford University, 1987.
[4] David G Luenberger, Yinyu Ye, et al. Linear and Nonlinear Programming, volume 2. Springer, 1984.