Hyperparameter Tuning with RayTune
What is hyperparameter tuning?
In machine learning, hyperparameter optimization or tuning is the problem of choosing a set of optimal hyperparameters for a learning algorithm. A hyperparameter is a parameter whose value is used to control the learning process.
How does hyperparameter tuning work?
Hyperparameter (HP) tuning works by running multiple trials in a single training job.
- Each trial is a complete execution of your training application with values for your chosen hyperparameters, set within limits you specify.
- The AI Platform training service keeps track of the results of each trial and makes adjustments for subsequent trials.
- When the job is finished, you can get a summary of all the trials along with the most effective configuration of values according to the criteria you specify.
Hyperparameter tuning optimizes a single target variable, also called the hyperparameter metric, that you specify.
- The metric must be a numeric value.
- You can specify whether you want to tune your model to maximize or minimize your metric.
What is Ray Tune?
Ray is an open-source unified compute framework that makes it easy to scale AI and Python workloads — from reinforcement learning to deep learning, to tuning, and model serving. Ray Tune is a very powerful, flexible, and popular open source Python package (part of Ray suit of tools) for hyperparameter tuning.
Tune has search algorithms that integrate with many popular optimization libraries, such as Nevergrad, HyperOpt, or Optuna. Tune automatically converts the provided search space into the search spaces the search algorithms and underlying libraries expect.
Ray Tune provides native integration with MLflow.
In summary, here are some unique advantages of using Ray Tune:
- Advanced Optimization Algorithms
- Minimal or no code restructuring
- Distributed Training
- MLflow Integration
- Early Stopping
You can find summary of the majority of hyperparameter tuning search algorithms supported by Ray Tune with examples at:
Ray Tune search implementation examples
Refer to the following sections to find implementation examples taken from Ray Tune documentation:
Grid & random search
Implementation example:
What is grid search?
In grid search:
- The search space of each hyperparameter is discretized.
- The algorithm then trains a model for every possible combination of hyperparameter values.
- At the end of the process, it selects the configuration that provides the best performance.
This method is highly parallelizable, assuming the availability of sufficient computing power to train multiple models simultaneously. However, it suffers from the curse of dimensionality: the number of configurations grows exponentially with the number of hyperparameters, making the search increasingly expensive as complexity increases.
What is random search?
Random search is:
- A variation of the grid search algorithm.
- It explores the hyperparameter space by randomly sampling combinations, rather than using a fixed Cartesian grid.
- The algorithm doesn’t explore all possibilities exhaustively; instead, it runs for a specified number of trials or within a fixed time budget.
- Like grid search, it can still suffer from the curse of dimensionality if a high sampling density is required across many parameters.
A key advantage of random search is that it can more efficiently explore the search space when hyperparameters are weakly correlated. This often leads to better results in less time, as it has a higher chance of sampling near-optimal values across independent dimensions.
Bayesian optimization search
Implementation example:
What is Bayesian optimization search?
Bayesian optimization:
- Starts by trying out a few different sets of hyperparameter values.
- It evaluates how well each set performs on the training data and uses that information to make educated guesses about which hyperparameters might work best.
- Based on these initial results, Bayesian optimization selects the next set of hyperparameters to try, taking into account both the best-performing sets and their associated performance.
It uses a statistical model called a Bayesian model to estimate which hyperparameters are likely to improve the model's performance the most. This process continues for several iterations, with the algorithm gradually narrowing down the range of hyperparameter values that are likely to yield the best results. By learning from previous evaluations, it becomes smarter in selecting hyperparameter combinations that have a higher chance of success.
- Bayesian optimization can only work on continuous hyperparameters, and not categorical ones.
- Gaussian processes suffer from a big computational cost if the number of evaluation is high. Thus, after a couple of thousands samples, the time to run Bayesian optimization might take significant time.
Tree Parzen Estimators (TPE) - HyperOpt search
Implementation example:
What is Tree Parzen Estimators (TPE) - HyperOpt search?
The TPE algorithm also uses a trial-and-error approach, but with a different strategy compared to traditional Bayesian optimization:
- TPE divides the search space of hyperparameters into two parts: the "good" values and the "bad" values.
- By iteratively sampling hyperparameters from the PDFs and updating them based on performance, TPE focuses its search on regions of the hyperparameter space that are more likely to yield better results.
Pros:
- Efficient Exploration: TPE uses a clever strategy to balance exploration and exploitation.
- Converges Quickly: TPE tends to converge to a good set of hyperparameter values relatively quickly.
- Handles High-Dimensional Spaces: TPE performs well in high-dimensional hyperparameter spaces. It effectively adapts its sampling strategy and avoids getting stuck in local optima, which is particularly beneficial when dealing with complex models that have numerous hyperparameters.
- Requires Fewer Evaluations: Compared to traditional grid search or random search methods, TPE often requires fewer evaluations (low computational resources) to find good hyperparameter settings.
Cons:
- Initialization Sensitivity: The initial set of hyperparameter settings can influence the performance of TPE. If the initial settings are poorly chosen, it might take longer for the algorithm to converge to optimal values.
- Lack of Theoretical Guarantees: TPE is a heuristic algorithm and lacks strong theoretical guarantees. While it has been empirically proven to work well in many scenarios, there is no guarantee that it will always find the global optimum.
- Sensitivity to Noisy Evaluations: If the evaluation of hyperparameter settings is noisy or inconsistent, TPE might struggle to accurately model the distribution of "good" and "bad" values. In such cases, the algorithm may make suboptimal choices and take longer to converge.
Resource optimization (CPU/GPU/RAM)
Example of how resource limits can be specified on each run during hyperparameter tuning run in Ray Tune:
MLflow integration
Example of how to use MLflow with Ray Tune: