Bayesian Optimization: How to calibrate the hyper-parameters of computationally expensive models?
A good choice for the hyperparameters of a machine learning model could be the difference between a very accurate model and a bad performing one. Consequently, many efforts have been made on understanding how to find the optimal hyperparameter for a given scenario. Nowadays, the hyperparameter search is usually done by performing grid search or random search, but in some scenarios the use of those strategies is infeasible. For instance, suppose you want to find the best learning rate to train a graph neural network over a set of 200 million nodes to detect fraud transactions. You know that the training of the network takes roughly 11 hours on a powerful GPU and also that the learning rate is likely to be in the range between 0.0000001 and 0.01. In your first attempt, you could try to split the search interval into 30 learning rate candidates (i.e. [0.01, 0.048, … , 0.00173, 0.0000001]), then train the model 30 times (one for each learning rate) and choose the learning rate that gives the best performance. However, you quickly notice that this is going to take ages, to be precise 330 hours or 13.75 days. Then, you realise you could try random search and train the model only 15 times, but it will still take roughly 7 days to run the experiments in an expensive GPU. What could we do to find the best learning rate then?
One excellent answer to that question is Bayesian Optimization (BO). The goal of BO is to reframe the original problem so we can perform an optimization over a cheap-to-evaluate surrogate model. Let’s recap a bit so the previous sentence is fully clear. First, recall that in practice what we have is a set of inputs (i.e. potential learning rates) that are fed into an expensive black-box operation that outputs a particular value for each input (i.e. cross-validation error for each learning rate). Notice that this objective function is not only challenging to optimise because it is expensive but also because many assumptions used in traditional optimisation approaches do not hold in this scenario. For instance, the function could not be convex or the problem is derivative-free, which means we do not have access to its gradients(i.e. Gradient-based optimization is not feasible). In order to overcome these issues, BO proposes to optimise a cheap probabilistic proxy model that approximates the original objective function. The most common proxy used in BO is the Gaussian Process (GP), which induces a distribution over functions that describe a process. Observe that by using a GP as the surrogate model we can include information about the available knowledge (i.e. pairs learning rate — error) in our modelling. In that regard, BO starts with a prior belief of the functions that describe the relationship between the inputs (i.e learning rates, margin parameter, batch size, learning rate scheduler, etc) and the output (i.e. cross-validated measure of the hyperparameter quality). Then, our initial belief is sequentially refined via Bayesian posterior update after a new point (i.e. learning rate) is proposed by the BO to be evaluated next. Finally, using this updated probabilistic model, an acquisition function evaluates the improvement of the candidate points and determines the best point to evaluate the true objective next, which is potentially the optimal point of our optimization problem.
At this point of the post, you could be thinking that the core of BO is the acquisition function. To some extent, you are right! Devising a good acquisition function is essential to efficiently find the optimal value. The challenge is that a good acquisition function should appropriately trade-off the exploration and the exploitation, in other words, it should consider the exploration of points with high uncertainty while also exploiting around points with low error. This is not trivial to do and for that reason the design of better acquisition functions is still an active area of research, especially in Reinforcement Learning. However, there are four acquisition functions that work pretty well in the scenario of hyperparameter tuning: (1) Probability of Improvement and (2) Expected Improvement, which introduces a slack variable that controls how aggressive the exploration is. We could also use (3) Lower Confidence Bound or (4) Entropy Search in order to use the predictive mean and variance of the GP to directly target the desired exploration of the space.
In summary, BO is a framework that takes advantage of the history of the optimization process to make the search of the optimal value efficient. This is done by proposing a probabilistic proxy model that captures our beliefs of the behaviour of the unknown variable and uses its posterior uncertainty to guide the exploration. The acquisition function evaluates the candidates and suggests a next point to query into the true objective (i.e. the large graph neural network). After observing the output value (i.e. the error of the suggested learning rate), the prior of the probabilistic proxy model is updated and the process is repeated as many times as needed. Remarkably, the number of times we need to query the true objective is less than the times that grid search or random search require!
As a final note, maybe you could be doubtful about what we have gained after using BO given that we also have to run a global optimizer inside the BO. The answer is that evaluating/optimising the acquisition function is cheap compared to evaluating the true objective. For instance, in Rappi we could think about finding the optimal hyperparameters of expensive models like GNN to detect transactional fraud, a reinforcement learning agent to mimic the payment behaviour of users worldwide, etc. Furthermore, BO is not only applied to hyperparameter tuning, it could be also used in experimentation. Particularly, in scenarios where we cannot afford to evaluate the true objective many times. For example, in Rappi we are interested in determining the service fee that minimises the churn, but obviously, we cannot risk our users running an experiment in a fashion similar to grid search or random search in a large search space. In those scenarios, BO has shown to have significant improvement as it reduces the potential space where the global optimum is located with a lower cost than traditional approaches.
Finally, after reading the blog post, could you say why it is called Bayesian Optimization? Leave your thoughts in the comments.
Special thanks to Ana M. Quintero Ossa for drawing such great illustrations.